Prepared by Vladimír
Benko within the framework of a joint Project of
Main design decisions
- Slovak-Centric (languages spoken and/or taught in Slovakia
and the neighbouring countries)
- Latin names denoting language and size
- Crawled by SpiderLing
at (approximately) the same time
- Language-independent filtration by the same tools
- Language-dependent filtration by the same methodology
by open-source or free tools
- All tagsets mapped into Araneum Universal Tagset
- Document-level deduplicated, duplicate and near-duplicate documents deleted
- Paragraph and/or sentence-level deduplicated, duplicate and near-duplicate segments marked
- Word sketches with compatible sketch grammars
- Accessible online via web interface:
Engine) at ucts.uniba.sk,
at kontext.korpus.cz, and
(under Sketch Engine) at www.sketchengine.co.uk
Corpora available by now (December 2014)
If you use the Aranea corpora for research purposes, or need to mention them for any reason,
please cite the following paper(s):
- Benko, Vladimír: Aranea: Yet Another Family of (Comparable) Web Corpora.
In Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (Eds.):
Text, Speech and Dialogue. 17th International Conference,
TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings.
Springer International Publishing Switzerland, 2014. pp. 257-264.
ISBN: 978-3-319-10815-5 (Print), 978-3-319-10816-2 (Online).
- Benko, Vladimír: Compatible Sketch Grammars for Comparable Corpora.
In Andrea Abel, Chiara Vettori, Natascia Ralli
(Eds.): Proceedings of the XVI EURALEX International Congress: The User In Focus. 15–19 July 2014.
Bolzano/Bozen: Eurac Research, 2014. pp. 417-430. ISBN 978-88-88906-97-3.
As well as the paper on the NoSketch Engine:
- Rychlý, Pavel: Manatee/Bonito – A Modular Corpus Manager.
In 1st Workshop on Recent Advances in Slavonic Natural Language Processing.
Brno: Masaryk University, 2007, pp. 65-70. ISBN 978-80-210-4471-5.
Please send your comments and/or questions to vladimir.benko