The block world corpus ====================== origins from the department of Department of Linguistics and Philology at Uppsala University, Sweden, www.lingfil.uu.se By the kind consent of professor Jörg Tiedemann, I've got the permission to use the corpus, translate it to new languages and licence it as I see fit. The corpus illustrates some linguistic features that are a nuisanse to machine translation. You can imagine the context as a kind of game with 4 players, two men and two women. The playing board has a lot of (sometimes) overlapping circles in different colours. Each player has got some markers in the form of three-dimensional objects like blocks, cones and arrows. A marker can be put in any circle, and the form of the blocks admits that other objects can be put on top of them. For more information on Swedish grammar, please see Swedish_grammar.txt It consists of two parts: 1. Files for training/developing a translation model - these files are named blockworld.parallel + language suffix e.g. blockworld.parallel.en 2. Files for building a statistical language model - these files are named blockworld.full + language suffix e.g. blockworld.full.en (The files are UTF-8 encoded to comply with the standard for Machine Translation. The national characters will be distorted if you use Windows and open the files with e.g. Notepad. I recommend Notepad++ for viewing and editing the files.) Originally the corpus was intended for experiments with statistical machine translation, but it might as well be used with rule based systems e.g. shallow transfer systems. The original corpus was in English and Swedish, but I have added some translations to new languages. It would be useful to have the corpus translated to more languages. I would very much appreciate if you translated the corpus to your language and sent the files to me. N.B. My translations to Danish, French, German and Spanish have so far not been reviewed by any native speaker and might contain errors. Per Tunedal