Statistical Machine Translation

Cambridge University

Recent News


Summer 2014

  • Matic Horvat will spend the summer at the USC Information Sciences Institute, as an Intern in the Natural Language Group
  • http://nlg.isi.edu/jobs.html


February 2014 - EACL'14 Student Research Workshop


December 2013 - Paper on Word Ordering Accepted to EACL 2014

  • Word Ordering with Phrase-Based Grammars
  • Adrià de Gispert, Marcus Tomalin, Bill Byrne
    • We describe an approach to word ordering using modelling techniques from statistical machine translation. The system incorporates a phrase-based model of string generation that aims to take unordered bags of words and produce fluent, grammatical sentences. We describe the generation grammars and introduce parsing procedures that address the computational complexity of generation under permutation of phrases. Against the best previous results reported on this task, obtained using syntax driven models, we report huge quality improvements, with BLEU score gains of 20+ which we confirm with human fluency judgements. Our system incorporates dependency language models, large n-gram language models, and minimum Bayes risk decoding.
    • http://mi.eng.cam.ac.uk/~wjb31/ppubs/EACL2014.pdf

December 2013 - Collaboration with Michael Riley and Cyril Allauzen of Google Research to appear in Computational Linguistics

  • Pushdown Automata in Statistical Machine Translation.
  • C. Allauzen, B. Byrne, A. de Gispert, G. Iglesias, M. Riley.
  • Computational Linguistics. To appear.
    • This paper describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata (FSA) representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger SCFGs and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of- the-art performance for large-scale SMT.
    • http://mi.eng.cam.ac.uk/~wjb31/ppubs/cl2013.final.pdf

October 2013 - Matic Horvat starts as a PhD student on SMT

  • Matic will work on Semantics in SMT, jointly supervised by Ann Copestake and Bill Byrne

2013-2014 Academic Visitors

  • Dr Tong Xiao from the Northeastern University, China is visiting Cambridge to work on syntax in Chinese-English machine translation.
  • Dr Anssi Yli.Jyrä from the University of Helsinki is visiting Cambridge as Clare Hall Research Fellow to work on syntax and weighted finite state automata in translation

August 2013 - WMT Presentation


July 2013 - International Conference on Finite-State Methods and Natural Language Processing -- FSMNLP 2013


June 2013 - Cognition Institute Summer School: Bilingual minds, bilingual machines


2013 Summer Students

  • Ed Hughes from the Department of Pure Maths and Mathematical Statistics will work with Rory Waite on SMT system optimisation using techniques from tropical geometry. Ed will be sponsored by the DPMMS Post Master’s Consultancy Scheme.


May 2013 - New Russian-English SMT system


April 2013 FP7 FAUST Project Concludes Successfully -- faust-fp7.eu/faust/

  • Project receives a rating of 'Excellent progress (the project has fully achieved its objectives and technical goals and has even exceeded expectations)' in its final review in Luxembourg.


2013-2014 Academic Visitors

  • Dr Tong Xiao from the Northeastern University, China is visiting Cambridge to work on syntax in Chinese-English machine translation.


Excellent results in Chinese and Arabic Translation in the 2012 NIST OpenMT Evaluation


September 2012 PBML Paper on using HFiles for fast MT model access

  • Simple and Efficient Model Filtering in Statistical Machine Translation
  • Juan Pino, Aurelien Waite, William Byrne
  • The Prague Bulletin of Mathematical Linguistics No. 98, 2012, pp. 5–24.
  • http://ufal.mff.cuni.cz/pbml/98/art-pino-waite-byrne.pdf
    • Data availability and distributed computing techniques have allowed statistical machine translation (SMT) researchers to build larger models. However, decoders need to be able to retrieve information efficiently from these models to be able to translate an input sentence or a set of input sentences. We introduce an easy to implement and general purpose solution to tackle this problem: we store SMT models as a set of key-value pairs in an HFile. We apply this strategy to two specific tasks: test set hierarchical phrase-based rule filtering and n-gram count filtering for language model lattice rescoring. We compare our approach to alternative strategies and show that its trade offs in terms of speed, memory and simplicity are competitive.


August 2012 MT Journal article on posteriors as translation confidence measures

  • N-gram posterior probability confidence measures for statistical machine translation: an empirical study
  • Adrià de Gispert, Graeme Blackwood, Gonzalo Iglesias and William Byrne
  • Machine Translation Journal,
  • http://www.springerlink.com/content/748552rj128q8337
    • We report an empirical study of n-gram posterior probability confidence measures for statistical machine translation (SMT). We first describe an efficient and practical algorithm for rapidly computing n-gram posterior probabilities from large translation word lattices. These probabilities are shown to be a good predictor of whether or not the n-gram is found in human reference translations, motivating their use as a confidence measure for SMT. Comprehensive n-gram precision and word coverage measurements are presented for a variety of different language pairs, domains and conditions. We analyze the effect on reference precision of using single or multiple references, and compare the precision of posteriors computed from k-best lists to those computed over the full evidence space of the lattice. We also demonstrate improved confidence by combining multiple lattices in a multi-source translation framework.

July 2012 FSMNLP paper on links between LMERT and tropical polynomials

  • Lattice-based minimum error rate training using weighted finite-state transducers with tropical polynomial weights.
  • A. Waite, G. Blackwood, W. Byrne
  • 10th International Workshop on Finite State Methods and Natural Language Processing (FSMNLP 2012), Donostia-San Sebastian, Spain, July 2012.
    • Minimum Error Rate Training (MERT) is a method for training the parameters of a log-linear model. One advantage of this method of training is that it can use the large number of hypotheses encoded in a translation lattice as training data. We demonstrate that the MERT line optimisation can be modelled as computing the shortest distance in a weighted finite-state transducer using a tropical polynomial semiring.


July 2012 Rory Waite is spending the summer in Los Angeles as an interning at SDL


June 2012 EAMT 2012 Best Paper Award !

  • Can Automatic Post-Editing Make MT More Meaningful?
  • Kristen Parton, Nizar Habash, Kathleen McKeown, Gonzalo Iglesias, Adrià de Gispert


April 2012

Federico Flego joins the Delphi SMT project as an RA


EAMT 2012 paper with Columbia University

  • Can Automatic Post-Editing Make MT More Meaningful?
    • Kristen Parton, Nizar Habash, Kathleen McKeown, Gonzalo Iglesias, Adrià de Gispert
    • Automatic post-editors (APEs) enable the re-use of black box machine translation (MT) systems for a variety of tasks where different aspects of translation are important. In this paper, we describe APEs that target adequacy errors, a critical problem for tasks such as cross-lingual question-answering, and compare different approaches for post-editing: a rule-based system and a feedback approach that uses a computer in the loop to suggest improvements to the MT system. We test the APEs on two different MT systems and across two different genres. Human evaluation shows that the APEs significantly improve adequacy, regardless of approach, MT system or genre: 30-56% of the post-edited sentences have improved adequacy compared to the original MT.

March 2012 Marcus Tomalin seminar at University of Edinburgh

  • Marcus Tomalin gave a seminar titled `In Search of `Natural' Speech: Grammaticality, Acceptability, and Speech Technology' at the Edinburgh Linguistics Circle, March 2012
    • Although state-of-the-art large vocabulary Automatic Speech Recognition (ASR) and Statistical Machine Translation (SMT) systems often achieve impressive Word Error Rates (WERs) and BLEU scores respectively, end-users frequently consider the word sequences output by such systems to be `unnatural'. The perceived `unnaturalness' usually results from the accumulation of many small linguistic errors (e.g., lack of subject-verb agreement, partially scrambled syntax, homophonic substitution). Consequently, in recent years there has been a renewed interest in improving the `naturalness' of ASR and SMT output, even in systems that produce good WER and BLEU scores.

      In this talk, the perceived `naturalness' of ASR and SMT transcriptions will be considered in the context of on-going debates about grammaticality and acceptability. An experimental framework for exploring these aspects of ASR/SMT transcriptions is described, and a methodology for improving the `naturalness' of such outputs is presented. The simplest ways of modifying an input word sequence are insertion, permutation, deletion, and substitution, and the approach adopted in this work makes use of a Combinatory Categorial Grammar (CCG) text generation system which enables input word sequences to be modified so as to improve their `naturalness'. It is shown that the output produced by the CCG-based system is considerably improved if the N-best generated hypotheses are rescored and reranked using Ngram-based techniques.


Yue Zhang appointed to Assistant Professor

Dr. Yue Zhang will take up position as Assistant Professor at Singapore University of Technology and Design with effect from July 2012. Yue has been a Research Associate at the Cambridge Computer Laboratory working on parsing and natural language generation for MT as part of the FAUST project.


February 2012 -- Article to appear in Speech Communication

  • Impacts of machine translation and speech synthesis on speech-to-speech translation
    Kei Hashimoto, Junichi Yamagishi, William Byrne, Simon King, Keiichi Tokuda


26--27 January 2012 -- Short Course on Weighted Finite State Transducers in Statistical Machine Translation


20 January 2012 -- Seminar


Postdoctoral Research Opportunities in SMT


Marcus Tomalin joins the FAUST project

  • Marcus will work on shallow generation for fluency in MT

2011 Visitors

October 3--9 -- Brian Roark, Oregon Health & Science University (OHSU).

October 10--11 -- Markos Mylonakis, University of Amsterdam.


September 2011

  • Graeme Blackwood (CUED PhD 2010) starts as Research Staff Member in the Machine Translation Group at the IBM T.J. Watson Research Center on the 3rd of October.
  • Cambridge Spanish-English and French-English interactive SMT systems are running on Reverso Labs
    Real-time SMT systems based on Cambridge's HiFST decoder.
    Try our systems and the other FAUST MT systems at http://labs.reverso.net


2011 Summer Internships

  • Juan Pino is in Mountain View, CA (USA) at Google, Inc. working on morphology in Russian MT
  • Matt Shannon is in London at Google, Inc. working on HMM-based speech synthesis


July 2011 -- Paper to be presented at the Conference on Empirical Methods in Natural Language Processing (EMNLP'11) -- joint work with Google Research

Hierarchical Phrase-based Translation Representations.
Gonzalo Iglesias, William Byrne, Adrià de Gispert, Department of Engineering, University of Cambridge
Cyril Allauzen, Michael Riley, Google Research


19 July 2011 -- Bill Byrne discusses interactive machine translation on the BBC World Service Radio Programme 'Click'

The FP7 FAUST project is featured in a discussion on openness on the internet

Listen: BBC Programme website, with audio

FAUST project website: http://faust-fp7.eu

An extended version of the interview broadcast on Click on BBC World Service Radio, 19th July 2011 is available on the The Open University website.


June 2011 -- Cambridge Spanish-English interactive SMT systems are running on Reverso Labs

Real-time SMT systems based on Cambridge's HiFST decoder.
Try our systems and the other FAUST MT systems at http://labs.reverso.net


April 2011 -- Gonzalo Iglesias, EAMT Best Thesis Awardee 2010

From the EAMT website http://www.eamt.org/news/news_best_thesis_winner.php :

  • Dr. Gonzalo Iglesias has received a prize of €500 and has been granted a €200 bursary so that he can present a summary of his thesis at the Annual Conference of the EAMT (EAMT-2011) which will take place in Leuven, on May 30-31, 2011.


2011 Talks and Presentations

  • A. de Gispert. Hierarchical Phrase-Based Translation at University of Cambridge. Talk at Barcelona Media Innovation Centre, Barcelona, Catalonia (Spain), July 2011.
  • A. de Gispert. Hierarchical Phrase-Based Translation at University of Cambridge. Talk at Catalonia Research Group on Accessibility and Ambient Intelligence (CaiaC), Universitat Autònoma de Catalunya, Bellaterra, Catalonia (Spain), July 2011.
  • A. de Gispert. Hierarchical Phrase-Based Representations: Decoding with Push-Down Transducers and Entropy-Pruned Language Models. Talk at DARPA GALE PI Meeting, Arlington, VA (USA), May 2011.
  • A. de Gispert. Hierarchical Phrase-Based Translation at University of Cambridge. Talk at Google Research Labs, Mountain View, CA (USA), May 2011.
  • G. Blackwood. Minimum Bayes-Risk Lattice Rescoring Methods for Statistical Machine Translation. Natural Language Processing Seminar, Computer Lab, University of Cambridge. May 2011.
  • G. Blackwood. Lattice Rescoring Methods for Statistical Machine Translation. Talk at SRI, Menlo Park, CA (USA), April 2011.
  • G. Iglesias. Hierarchical Phrase-based Translation with Weighted Finite State Transducers. Invited Presentation and Best Thesis Award for 2010. 15th Annual Conference of the European Association for Machine Translation, Leuven, Belgium, March 2011.
  • G. Blackwood. Lattice Rescoring Methods for Statistical Machine Translation. Talk at IBM TJ Watson Research Labs, Yorktown Heights, NY (USA). February, 2011.


September 2010 -- Paper published in Computational Linguistics

Adrià de Gispert, Gonzalo Iglesias, Graeme Blackwood, Eduardo Banga, William Byrne. Hierarchical Phrase-based Translation with Weighted Finite State Transducers and Shallow-N Grammars. Computational Linguistics 36(3):505-533. September 2010 (PDF) (bib)


2010 Talks and Presentations

  • William Byrne. Hierarchical phrase-based translation with weighted finite state transducers. Keynote speech at IWSLT 2010, Paris (France), December 2010.
  • William Byrne. Hierarchical phrase-based translation with weighted finite state transducers .Natural Language Processing Group, Department of Computer Science, University of Sheffield, UK, December 2010.
  • William Byrne. Recent research in statistical machine translation. Winton Capital Management Internal Research Conference, November 2010. Invited presentation.
  • William Byrne. Hierarchical phrase-based translation with weighted finite state transducers. Dublin Computational Linguistics Research Seminar, Dublin, Ireland, November 2010.
  • Matthew Gibson and William Byrne. EMIME project overview. European Commission Information Society Conference (ICT 2010), Brussels, Belgium, September 2010.
  • A. de Gispert. Hierarchical Phrase-Based Translation with weighted finite state transducers. Talk at IST / INESC-id, Lisbon (Portugal), July 2010.
  • William Byrne and Adrià de Gispert. Fast Hiero grammars. DARPA GALE PI Meeting, Scottsdale, AZ, USA, April 2010.
  • William Byrne. Hierarchical phrase-based translation with weighted finite state transducers. Columbia University, New York, NY, USA, April 2010.
  • William Byrne. Hierarchical phrase-based translation with weighted finite state transducers. Google, Inc, Mountain View, CA, USA, April 2010.
  • William Byrne. FAUST project overview. ICT-FP7 Language Technology Days, Luxembourg, March 2010.


2010 Conference Papers

Adrià de Gispert, Juan Pino, William Byrne. Hierarchical phrase-based translation grammars extracted from alignment posterior probabilities. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cambridge, MA, 2010.

Graeme Blackwood, Adrià de Gispert, William Byrne. Fluency Constraints for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices. Proceedings of the International Conference on Computational Linguistics (COLING) 2010

Juan Pino, Gonzalo Iglesias, Adrià de Gispert, Graeme Blackwood, Jamie Brunning and William Byrne. The CUED HiFST System for the WMT10 Translation Shared Task. ACL 2010 Joint Fifth Workshop on Statistical Machine Translation

Graeme Blackwood, Adrià de Gispert, William Byrne. Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices. Proc. Annual Meeting of the Association for Computational Linguistics (ACL) 2010

Mikko Kurimo, et al. Personalising speech-to-speech translation in the EMIME project. Proc. Annual Meeting of the Association for Computational Linguistics (ACL) 2010 (Demo session)


5 April 2010

Gonzalo Iglesias starts as an RA on the FAUST project


2010 Google Research Award

Weighted Finite State Transducers in Hierarchical Phrase-Based Translation
PI: Bill Byrne
Google contact: Michael Riley


WMT 2010 Shared Translation Tasks

Excellent performance in translation between Spanish, French, and English


New FP7 research project on interactive statistical machine translation

FAUSTLogo.gif


NIST 2009 Open Machine Translation Evaluation -- Top-ranked Arabic-to-English SMT system

Our system placed first in both the Single System Track and the System Combination Track