Spoken Wikipedia

The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. We turn this speech resource into a time-aligned corpus, making it accessible for research and to foster new ways of interacting with the material.

I've lead the initiative to turn the Spoken Wikipedia into a well-formed corpus, that presently is the largest collection of freely licensed German and Dutch speech (with >400h per language). We've since used the corpus to analyze the prosody-syntax interface with more data than ever before.

All details and recent developments about the corpus are available at its dedicated webpage; code can be found on bitbucket.

The data is persistently available at doi:10.25592/uhhfdm.1875.

Collaborators:

Arne Köhn (collaborator and co-lead)
Felix Hennig (research student)
Florian Stegen, Marcel Rohde, Oskar Dörfler (students who graduated about SWC topics)
Siva Meenakshi Renganathan (intern)
Jonathan Werner, Paul Bienkowski, Florian Stegen, Tim Dobert (students in project on long audio alignment)

Related Publications:

Timo Baumann, Arne Köhn and Felix Hennig (2019).
"The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening".
Language Resources and Evaluation, 53(2):303–329. . Special Issue representing significant contributions of LREC 2016.
URN, DOI, pdf (preprint), webpage, bibtex
Arne Köhn, Timo Baumann and Oskar Dörfler (2018).
"An Empirical Analysis of the Correlation of Syntax and Prosody".
in: Proceedings of Interspeech. Hyderabad, India, pages 2157-2161.
DOI, pdf, poster, bibtex
Timo Baumann (2018).
"Learning to Determine Who is the Better Speaker".
in: Proceedings of Speech Prosody. Poznán, Poland.
DOI, pdf, bibtex
Timo Baumann (2017).
"Large-scale Speaker Ranking from Crowdsourced Pairwise Listener Ratings".
in: Proceedings of Interspeech. Stockholm, Sweden.
DOI, pdf, poster, data, bibtex
Marcel Rohde and Timo Baumann (2016).
"Navigating the Spoken Wikipedia".
in: Proceedings of the Workshop on Spoken Language Processing for Assistive Technologies. San Francisco, USA.
URN, pdf, bibtex
Arne Köhn, Florian Stegen and Timo Baumann (2016).
"Mining the Spoken Wikipedia for Speech Data and Beyond".
in: Proceedings of LREC. Portoroz, Slovenia.
URN, pdf, poster, webpage, bibtex

Related Outreach Activities:

"Die gesprochene Wikipedia hilft und profitiert von freier Sprachtechnologie", Talk at WikiCon 2016, the annual Wikipedia meeting of the German-speaking Wikipedia, Stuttgart, 2016-09-17. Participation funded by Wikimedia Deutschland e.V.

Project

Edit | History | Recent Changes (all) | Search

Page last modified on February 29, 2024, at 12:06 PM