FAUST - Feedback Analysis for User adaptive Statistical Translation

Data sets created for use by the FAUST project

User feedback corpus

This release comprises entries drawn from the weblogs at Reverso.net. It contains 6,346 log entries with suggestions of better translation, namely around 10% of the total amount of feedback collected during the 3 years of the project in the FAUST language pairs. Please contact Theo Hoffenberg, theo -at- softissimo -dot- com , if you are interested in obtaining a larger portion of the collection.

Language direction	Sample	Total feedback collected during FAUST	% sample
English->French (EN-FR)	2490	27763	9%
French->English (FR-EN)	1596	16894	9%
English->Spanish (EN-SP)	854	6668	13%
Spanish->English (SP-EN)	501	3866	13%
English->Czech (EN-CZ)	802	6154	13%
Czech->English (CZ-EN)	47	313	15%
English->Romanian (EN-RO)	44	44	1%
Romanian->English (RO-EN)	12	12	1%
Total	6346	61714	10%

ftp://mi.eng.cam.ac.uk/data/faust//FaustFeedbackSample.xls.gz

Test sets

This package contains raw testsets crawled from web material, cleaner versions and its respective translation references for 9 European language pairs.

ftp://mi.eng.cam.ac.uk/data/faust/FAUST-1.0.tgz

Translation Feedback

(1) Analysis and annotation of a corpus of open-domain, real-world automatic translations

The quality assessments provide relative ranking and absolute (satisfactory/non satisfactory) adequacy assessments for c.a. 12,000 translations generated from 2,000 English translation requests submitted to Softissimo's translation portal http://reverso.net. These two layers of annotation are complementary and useful in different ways, and they can be exploited to learn models of quality with different applications, i.e., to select among alternative translations or to discard unsatisfactory outputs. A professional translator corrected the most obvious typos in the input sentences and provided reference translations into Spanish for all of them. The corrected sentences have been automatically translated into Spanish with five different systems.

ftp://mi.eng.cam.ac.uk/data/faust//UPC-Oct2011-FAUST-quality-assessments.tgz

(2) FFF and FFF+ corpora with annotations on the correction of human feedback post-editions

Faust Feedback Filtering corpora (FFF and FFF+), consists of quadruples of , manually annotated with binary assessments on the usefulness of the post-editing human feedback (i.e., whether the human feedback represents a better translation than the automatic translation). All the instances come from Reverso.net's weblogs. Its main purpose is the to possibilitate the learning of feedback filters to automatically identify the useful instances to be incorporated into a SMT engine. FFF is the original version from October 2011 with data for en-fr and en-es language pairs. FFF+ is an enlarged and re-annotated version of the en-es part of FFF. Both corpora can be downloaded with the ftp links below:

ftp://mi.eng.cam.ac.uk/data/faust//LW-UPC-Oct2011-FAUST-feedback-annotation.tgz (FFF) ftp://mi.eng.cam.ac.uk/data/faust//UPC-Mar2013-FAUST-feedback-annotation.tgz (FFF+)

Preliminary release of user feedback corpus.

Note that this release comprises entries drawn from the weblogs at Reverso.net. The FAUST project and the project participants are not responsible for its content. This release contains entries filtered so that the source text (i.e. the original translation request) is no more than twenty words in length; no other processing was done to this data. The distribution is in 3 parts:

FAUST_FBOct2010v0.1.part1.rar: User Feedback Release v0.1 Oct 2010 (5.1MB)

FAUST_FBOct2010v0.1.part2.rar: User Feedback Release v0.1 Oct 2010 (5.1MB)

FAUST_FBOct2010v0.1.part3.rar: User Feedback Release v0.1 Oct 2010 (1.2MB)

Linguistically annotated parallel corpora

Static Monolingual and Parallel Corpora for Catalan, Czech, English, French, Spanish and Romanian

Collections are described in this document: FAUSTD4.2.pdf

Language	Corpora	File
Catalan	el_periodico	ftp://mi.eng.cam.ac.uk/data/faust//el_periodico_ca-es.ca.conll.gz
Czech	news.shuffled	ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.cz.conll.gz
Czech	CzEng 0.9	http://ufal.mff.cuni.cz/czeng/czeng09/
Czech	CzEng 1.0	http://ufal.mff.cuni.cz/czeng/czeng10/
Czech	news-commentary10	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10.cz.conll.gz
Czech	news-commentary10_cz-en	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10_cz-en.cz.conll.gz
English	news.shuffled	ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.en.conll.part1.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.en.conll.part2.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.en.conll.part3.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.en.conll.part4.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.en.conll.part5.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.en.conll.part6.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.en.conll.part7.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.en.conll.part8.rar
English	CzEng 0.9	http://ufal.mff.cuni.cz/czeng/czeng09/
English	CzEng 1.0	http://ufal.mff.cuni.cz/czeng/czeng10/
English	united_nations_es-en	ftp://mi.eng.cam.ac.uk/data/faust//united_nations_es-en.en.conll.gz
English	europarl-v5	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v5.en.conll.gz
English	europarl-v5_es-en	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v5_es-en.en.conll.gz
English	europarl-v5_fr-en	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v5_fr-en.en.conll.gz
English	europarl-v6_ro-en	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v6_ro-en.en.conll.gz
English	news-commentary10	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10.en.conll.gz
English	news-commentary10_cz-en	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10_cz-en.conll.gz
English	news-commentary10_es-en	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10_es-en.conll.gz
English	news-commentary10_fr-en	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10_fr-en.conll.gz
English	wmt10.select_es-en	ftp://mi.eng.cam.ac.uk/data/faust//wmt10.select_es-en.en.conll.gz
French	news.shuffled	ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.fr.conll.part1.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.fr.conll.part2.rar
		ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.fr.conll.part3.rar
French	europarl-v5	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v5.fr.conll.gz
French	europarl-v5_fr-en	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v5_fr-en.fr.conll.gz
French	news-commentary10	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10.fr.conll.gz
French	news-commentary10_fr-en	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10_fr-en.fr.conll.gz
Romanian	europarl-v6_ro-en	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v6_ro-en.ro.conll.gz
Spanish	united_nations_es-en	ftp://mi.eng.cam.ac.uk/data/faust//united_nations_es-en.es.conll.gz
Spanish	news.shuffled	ftp://mi.eng.cam.ac.uk/data/faust//news.shuffled.es.conll.gz
Spanish	el_periodico	ftp://mi.eng.cam.ac.uk/data/faust//el_periodico_ca-es.es.conll.gz
Spanish	europarl-v5	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v5.es.conll.gz
Spanish	europarl-v5_es-en	ftp://mi.eng.cam.ac.uk/data/faust//europarl-v5_es-en.es.conll.gz
Spanish	news-commentary10	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10.es.conll.gz
Spanish	news-commentary10_es-en	ftp://mi.eng.cam.ac.uk/data/faust//news-commentary10_es-en.es.conll.gz
Spanish	wmt10.select_es-en	ftp://mi.eng.cam.ac.uk/data/faust//wmt10.select_es-en.es.conll.gz

Manual annotation of Czech and English Translation Dev/Test Sets

One of the tasks is to develop robust syntactic parsers that would be able to parse output of the machine translation systems, which are often very “noisy” and contain many grammatical, lexical or word-order mistakes. In order to tune such robust parsers, target side of a part of Faust Dev/Test sets was manually annotated on the level of deep syntax. We have not made the annotations directly on the MT outputs, because they are not stable and they strongly depend on translation engines. For this reason, we decided to do the manual annotations of the reference translations. The correct annotation of the MT output could be then projected from the reference translations. The following package contains 3000 manually annotated Czech segments (reference translations from English) and 2000 English segments (reference translations from Czech).

tran-0.5.tar.gz

Conversion of CzEng parallel corpus into CoNLL format

CzEng - the Czech-English parallel corpus has been in its version 0.9 automatically analyzed on the levels of morphology, syntax and deep-syntax. It consists of aproximatelly 80 milion sentences (93MW of English and 82MW). You can download it (after filling the registration form) from http://ufal.mff.cuni.cz/czeng/czeng09/. Its "export_format" can be easily converted into CoNLL-2009 format using the folowing perl script.

czeng_to_conll.pl

Quick Links: FAUST Home; FAUST Project Partners; Publications; FAUST Project Internal Pages