Spoken Wikipedia

The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. We turn this speech resource into a time-aligned corpus, making it accessible for research and to foster new ways of interacting with the material.

I've lead the initiative to turn the Spoken Wikipedia into a well-formed corpus, that presently is the largest collection of freely licensed German and Dutch speech (with >400h per language). We've since used the corpus to analyze the prosody-syntax interface with more data than ever before.

All details and recent developments about the corpus are available at its dedicated webpage; code can be found on bitbucket.

The data is persistently available at doi:10.25592/uhhfdm.1875.


  • Arne Köhn (collaborator and co-lead)
  • Felix Hennig (research student)
  • Florian Stegen, Marcel Rohde, Oskar Dörfler (students who graduated about SWC topics)
  • Siva Meenakshi Renganathan (intern)
  • Jonathan Werner, Paul Bienkowski, Florian Stegen, Tim Dobert (students in project on long audio alignment)

Related Publications:

  • Timo Baumann, Arne Köhn and Felix Hennig (2019).
    "The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening".
    Language Resources and Evaluation, 53(2):303–329. . Special Issue representing significant contributions of LREC 2016.
    URN, DOI, pdf (preprint), webpage, bibtex
  • Arne Köhn, Timo Baumann and Oskar Dörfler (2018).
    "An Empirical Analysis of the Correlation of Syntax and Prosody".
    in: Proceedings of Interspeech. Hyderabad, India, pages 2157-2161.
    DOI, pdf, poster, bibtex
  • Timo Baumann (2018).
    "Learning to Determine Who is the Better Speaker".
    in: Proceedings of Speech Prosody. Poznán, Poland.
    DOI, pdf, bibtex
  • Timo Baumann (2017).
    "Large-scale Speaker Ranking from Crowdsourced Pairwise Listener Ratings".
    in: Proceedings of Interspeech. Stockholm, Sweden.
    DOI, pdf, poster, data, bibtex
  • Marcel Rohde and Timo Baumann (2016).
    "Navigating the Spoken Wikipedia".
    in: Proceedings of the Workshop on Spoken Language Processing for Assistive Technologies. San Francisco, USA.
    URN, pdf, bibtex
  • Arne Köhn, Florian Stegen and Timo Baumann (2016).
    "Mining the Spoken Wikipedia for Speech Data and Beyond".
    in: Proceedings of LREC. Portoroz, Slovenia.
    URN, pdf, poster, webpage, bibtex

Related Outreach Activities:

  • "Die gesprochene Wikipedia hilft und profitiert von freier Sprachtechnologie", Talk at WikiCon 2016, the annual Wikipedia meeting of the German-speaking Wikipedia, Stuttgart, 2016-09-17. Participation funded by Wikimedia Deutschland e.V.


Page last modified on February 29, 2024, at 12:06 PM
Powered by PmWiki