Corpus research on the translation of English multi-word discourse markers into Hebrew and Lithuanian

Giedre Valunaite Oleskeviciene
Chaya Liebeskind

Abstract. The creation of multilingual resources is necessary for research in multiple languages and making these resources available for the Linked Data paradigm. The current research has created such an extensive parallel data resource focused on multi-word expressions as discourse markers. Multi-word expressions are important in language generation and processing. They often contribute to discourse organization and often function as discourse markers. In the present research, by researching their changes in translation, we combine the alignment model of the phrase-based statistical machine translation and manual processing of the data to examine English multi-word discourse markers and their equivalents in Lithuanian and Hebrew translation. After compiling the list of multi-word discourse markers we checked the numbers of discourse markers present in our generated parallel corpus, we focused on the median group of multi-word discourse markers. Our research proves that the examined multi-word discourse markers have a tendency to remain multi-word lexemes in Hebrew translation, but due to the translation tendency of relying on inflections, they are one-word discourse markers in Lithuanian. Also, there is possible context-based influence guiding the translation to choose a particle or connective integration in Lithuanian or Hebrew translated discourse markers by adding contextual pragmatic meaning.
