Treebank of Learner English

The Treebank of Learner English (TLE) is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus.


The dataset and the annotation guidelines were developed in a collaborative project between MIT CSAIL, MIT Linguistics and CBMM. Detailed information about the annotations is available in this paper. A draft of the manual used by the annotators is available here. Please contact Yevgeni Berzak (berzak at mit dot edu) if you spot any errors or inconsistencies in the treebank, or if you have questions about the ESL annotation guidelines.


Citation
You are encouraged to cite the following papers when using the dataset:

  • Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza and Boris Katz (2016) "Universal Dependencies for Learner English", Annual Meeting of the Association for Computational Linguistics (ACL). [PDF] [bib]
  • Helen Yannakoudakis, Ted Briscoe and Ben Medlock (2011) "A New Dataset and Method for Automatically Grading ESOL Texts", Annual Meeting of the Association for Computational Linguistics (ACL), pages 180-189.


Download
The TLE is available publicly through the UD repository (English-ESL).


People
Annotation and guidelines: Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, Boris Katz.
Web search: Emily Kellison-Lynn, Yevgeni Berzak, Andrei Barbu, Jessica Kenney.


Team Annotations: Yevgeni, Lucia, Sophie, Jessica, Jing, Sebastian.