Algorithm for Semantic Network Generation from Texts of Low Resource Languages Such as Kiswahili

Wamkaya Wanjawa, Barack; Muchemi, Lawrence; Miriti, Evans

Center for Open Access in Science (COAS)
OPEN JOURNAL FOR INFORMATION TECHNOLOGY (OJIT)
ISSN (Online) 2620-0627 * ojit@centerprode.com

OJIT Home

2024 - Volume 7 - Number 2

Algorithm for Semantic Network Generation from Texts of Low Resource Languages Such as Kiswahili

Barack Wamkaya Wanjawa * ORCID: 0000-0003-0198-3179
University of Nairobi, Department of Computer Science, Nairobi, KENYA

Lawrence Muchemi * ORCID: 0000-0001-5911-5679
University of Nairobi, Department of Computer Science, Nairobi, KENYA

Evans Miriti * ORCID: 0000-0002-6949-7700
University of Nairobi, Department of Computer Science, Nairobi, KENYA

Open Journal for Information Technology, 2024, 7(2), 55-70 * https://doi.org/10.32591/coas.ojit.0702.01055w
Received: 23 February 2024 ▪ Revised: 19 November 2024 ▪ Accepted: 13 December 2024

LICENCE: Creative Commons Attribution 4.0 International License.

ARTICLE (Full Text - PDF)

ABSTRACT:
Processing low-resource languages, such as Kiswahili, using machine learning is difficult due to lack of adequate training data. However, such low-resource languages are still important for human communication and are already in daily use and users need practical machine processing tasks such as summarization, disambiguation and even question answering (QA). One method of processing such languages, while bypassing the need for training data, is the use semantic networks. Some low resource languages, such as Kiswahili, are of the subject-verb-object (SVO) structure, and similarly semantic networks are a triple of subject-predicate-object, hence SVO parts of speech tags can map into a semantic network triple. An algorithm to process raw natural language text and map it into a semantic network is therefore necessary and desirable in structuring low resource languages texts. This algorithm tested on the Kiswahili QA task with up to 78.6% exact match.

KEY WORDS: algorithm, low resource language, question answering, semantic networks, Kiswahili.

CORRESPONDING AUTHOR:
Barack Wamkaya Wanjawa, University of Nairobi, Department of Computer Science, Nairobi, KENYA.

REFERENCES:

Aflat (2020). Kiswahili Part-of-Speech Tagger - Demo AfLaT.org. Retrieved 14 December 2020, from https://www.aflat.org/swatag.

Berners-Lee, T. (2006). Linked Data. Retrieved 6 July 2022, from https://www.w3.org/DesignIssues/LinkedData.html.

Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56(1). Elsevier B.V. https://doi.org/10.1016/j.specom.2013.07.008

Brachman, R. J., & Levesque, H. J. (2004). Knowledge Representation and Reasoning. Knowledge Representation and Reasoning. Morgan Kaufmann. https://doi.org/10.1016/B978-1-55860-932-7.X5083-3

Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., & Palomaki, J. (2020). TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. ArXiv Preprint ArXiv:2003.05002.

Contributors to Wikimedia projects (2021). Chelsea F.C. - Wikipedia. Retrieved 8 November 2021, from https://en.wikipedia.org/w/index.php?title=Chelsea_F.C.&oldid=1054654568.

De Cao, N., Aziz, W., & Titov, I. (2019). Question answering by reasoning across documents with graph convolutional networks. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1 (Long and Short Papers), 2306-2317.

Gell-Mann, M., & Ruhlen, M. (2011). The origin and evolution of word order. Proceedings of the National Academy of Sciences, 108(42), 17290-17295. https://doi.org/10.1073/PNAS.1113716108

Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245), 261-266.

Kenya Ministry of Education (n.d.). Brief on Tusome Early Literary Programme. Retrieved 12 December 2020, from https://www.education.go.ke/images/Project-KPED/Brief%20on%20TUSOME%20.pdf.

King, B. P. (2015). Practical Natural Language Processing for Low-Resource Languages. Retrieved 05 June 2020, from https://deepblue.lib.umich.edu/handle/2027.42/113373.

Li, X., & Boucher, M. (2013). Under the Hood: The natural language interface of Graph Search. Retrieved 16 October 2020, from http://www.facebook.com/notes/facebook-engineering/under-the-hood-the-natural-language-interface-of-graph-search/10151432733048920.

Li, Y., Tan, S., Sun, H., Han, J., Roth, D., & Yan, X. (2016). Entity disambiguation with linkless knowledge bases. 25th International World Wide Web Conference, WWW 2016, 1261-1270. https://doi.org/10.1145/2872427.2883068

Markovic, V., & Nelamangala, V. (2017). Building the Activity Graph, Part I. Retrieved 5 July 2020, from https://engineering.linkedin.com/blog/2017/06/building-the-activity-graph--part-i.

Marno, H., Langus, A., Omidbeigi, M., Asaadi, S., Seyed-Allaei, S., & Nespor, M. (2015). A new perspective on word order preferences: the availability of a lexicon triggers the use of SVO word order. Frontiers in Psychology, 6, 1183. https://doi.org/10.3389/fpsyg.2015.01183

omniglot (2021). Swahili alphabet, pronunciation and language. Retrieved 8 September 2022, from https://omniglot.com/writing/swahili.htm.

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1532-1543.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. ArXiv Preprint ArXiv:1802.05365.

Piper, B., Destefano, J., Kinyanjui, E. M., & Ong’ele, S. (2018). Scaling up successfully: Lessons from Kenya’s Tusome national literacy program. Journal of Educational Change, 19(3), 293-321.

RDF Grapher (2021). https://www.ldf.fi/service/rdf-grapher.

Sánchez-Martínez, F., Sánchez-Cartagena, V. M., Antonio Pérez-Ortiz, J., Forcada, M. L., Espì A-Gomis, M., Secker, A., Coleman, S., & Wall, J. (2020). An English-Swahili parallel corpus and its use for neural machine translation in the news domain. November, 299-308. https://github.com/bitextor/bicleaner/.

Singhal, A. (2012). Introducing the Knowledge Graph: things, not strings – Inside Search, 2013: 7/22/2013. http://insidesearch.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html.

Song, L., Wang, Z., Yu, M., Zhang, Y., Florian, R., & Gildea, D. (2018). Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. ArXiv Preprint ArXiv:1809.02040.

The Stanford Question Answering Dataset (2021). Retrieved 16 March 2021, from https://rajpurkar.github.io/SQuAD-explorer.

treetagger. (2020). TreeTagger. Retrieved 14 December 2020, from https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger.

Wanjawa, B., & Muchemi, L. (2021). Model for Semantic Network Generation from Low Resource Languages as Applied to Question Answering – Case of Swahili. 2021 IST-Africa Conference (IST-Africa), 1-8.

Wanjawa, B. W., Wanzare, L. D. A., Indede, F., McOnyango, O., Muchemi, L., & Ombui, E. (2023). KenSwQuAD — A Question Answering Dataset for Swahili Low-resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4), 1-20.

Welbl, J., Stenetorp, P., & Riedel, S. (2018). Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6, 287-302.

Wu, C., & Wu, T. (n.d.). Typologically Diverse QA: How many training examples do you need for a new language anyway?

Yan, P., & Jin, W. (2017). Building semantic kernels for cross-document knowledge discovery using Wikipedia. Knowledge and Information Systems, 51(1), 287-310. https://doi.org/10.1007/s10115-016-0973-5

Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 7370-7377.