The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. We turn this speech resource into a time-aligned corpus, making it accessible for research and to foster new ways of interacting with the material.
I've lead the initiative to turn the Spoken Wikipedia into a well-formed corpus, that presently is the largest collection of freely licensed German and Dutch speech (with >400h per language). We've since used the corpus to analyze the prosody-syntax interface with more data than ever before.
All details and recent developments about the corpus are available at its dedicated webpage; code can be found on bitbucket.
- Arne Köhn (collaborator and co-lead)
- Felix Hennig (research student)
- Florian Stegen, Marcel Rohde, Oskar Dörfler (students who graduated about SWC topics)
- Siva Meenakshi Renganathan (intern)
- Jonathan Werner, Paul Bienkowski, Florian Stegen, Tim Dobert (students in project on long audio alignment)
- Timo Baumann, Arne Köhn and Felix Hennig (2019).
"The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening".
Language Resources and Evaluation, 53(2):303–329. . Special Issue representing significant contributions of LREC 2016.
URN, DOI, pdf (preprint), webpage, bibtex
- Arne Köhn, Timo Baumann and Oskar Dörfler (2018).
"An Empirical Analysis of the Correlation of Syntax and Prosody".
in: Proceedings of Interspeech. Hyderabad, India, pages 2157-2161.
DOI, pdf, poster, bibtex
- Timo Baumann (2018).
"Learning to Determine Who is the Better Speaker".
in: Proceedings of Speech Prosody. Poznán, Poland.
DOI, pdf, bibtex
- Timo Baumann (2017).
"Large-scale Speaker Ranking from Crowdsourced Pairwise Listener Ratings".
in: Proceedings of Interspeech. Stockholm, Sweden.
DOI, pdf, poster, data, bibtex
- Marcel Rohde and Timo Baumann (2016).
"Navigating the Spoken Wikipedia".
in: Proceedings of the Workshop on Spoken Language Processing for Assistive Technologies. San Francisco, USA.
URN, pdf, bibtex
- Arne Köhn, Florian Stegen and Timo Baumann (2016).
"Mining the Spoken Wikipedia for Speech Data and Beyond".
in: Proceedings of LREC. Portoroz, Slovenia.
URN, pdf, poster, webpage, bibtex
Related Outreach Activities:
- "Die gesprochene Wikipedia hilft und profitiert von freier Sprachtechnologie", Talk at WikiCon 2016, the annual Wikipedia meeting of the German-speaking Wikipedia, Stuttgart, 2016-09-17. Participation funded by Wikimedia Deutschland e.V.