A Literature Review on Automatic Generation of Examinations

The examination is a key activity in determining what the learner has gained from the study. Institutions of higher learning (IHL) perform this activity through various assessment methods (test/examination, practical, etc.). The world today is focused on automation of exam generation which is ongoing with dire need during this period of the COVID-19 pandemic when education is greatly affected, leading to embracing online learning and examination. A text/exam comprises questions and answers that focus on evaluation to determine the student ’ s conversant level in the area of study. Each question has a cognitive level as described by (Armstrong, 2016) in the revised Bloom ’ s taxonomy. Questions chosen have cognitive levels based on the level of study and standardization of the exam. There is, therefore, a need to consider the question ’ s cognitive level along with other factors when generating an examination by incorporating deep


Introduction
Over time, there has been a notable increase in the number of students joining tertiary institutions. To manage the increase, institutions have responded by creating flexible learning patterns including introducing e-learning and embracing technology in digitizing the majority of the work involved. Automation has been adopted to enhance efficiency and effectiveness at a reduced timeframe. Flexible learning pattern calls for flexible examination pattern thus examiners are challenged to re-think an approach to cater to this need. The solution is to semi or fully automate the examination process to minimize human intervention and increase efficiency and effectiveness.
There has been a challenge in developing the examinations as questions and answers are not readily generated. Researchers have indulged in the automatic questions generation with the majority focusing on the multiple-choice questions and "wh" questions as demonstrated by (Ali et al., 2018). Question cognitive level, weight, and topic coverage are key factors to consider when setting exams. Most of the researchers focused on vocabulary assessment and understanding while few studies check question complexity based on the complete spectrum of Bloom's taxonomy. Little has been done on the use of Bloom's taxonomy in exam generation.

Methodology
This study is constrained to the classification of questions for generation of examinations purposes.

Standard examination
There are many types of assessment or "testing" to access student's learning curves, however, the written examination is the most common approach used by any higher education institution for students' assessment (Omar et al., 2012). An exam contains questions that some studies have sought to classify exam based on Bloom's taxonomy (Abduljabbar & Omar, 2015). In recent times, there have been attempts to classify the questions using Bloom's taxonomy by various researchers using diverse methods. The question cognitive level has been determined using techniques like machine learning, role-based approach among others.
IHL offers studies in one or more levels of study ranging from hands-on skills to professional programs. A question can be examined at various levels of study to test learning. There is, therefore, a need to analyze exam questions to fulfill the requirements by different levels of education such as Bachelor's degree or Master's level (Mohammedid & Omar, 2020).
Content validity, scorer reliability, discrimination, and objectivity are the four principles identified by Johnson (2001) that constitute a standard examination. Content validity is representative coverage of the whole course. Scorer reliability directs that if the script is subjected to two different examiners, they should arrive at the same score, i.e. there shouldn't be a significant statistical difference in score. There should be a way to differentiate the achievers and weak students to avoid discrimination. Objectivity prescribes that the test should be fair to all irrespective of age, gender, religion, or any other natural distinction. Examiners should ensure that the test to students in the same level or class test similar concepts and are sensitive to questions cognitive levels to enforce the Objectivity demand.

Exam questions
Questions bear different difficulty levels (Krathwohl, 2002). The difficulty levels build in increasing order from basic, rote memorization to higher (more difficult and sophisticated) levels of critical thinking skills. Therefore, question cognitive levels must be put into consideration during examination generation to facilitate standardization. Failure to consider this may lead to an imbalanced test, i.e., containing many sophisticated questions making it hard for the students or vice versa.

A revision of Bloom's taxonomy
Bloom's taxonomy revised edition by (Krathwohl, 2002), breaks the cognitive domain into six levels: (a) Remember -This level entails remembering what is learned.
(b) Understand -It is the ability to interpret and comprehend in such a way that one can state the problem in their own words.
(c) Apply -Ability to use the concept to solve a new problem.
(d) Analyze -This is critically breaking down of information into parts guided by motives or causes and developing inferences that support generalizations.
(e) Synthesis -This is the ability to come up with something new by putting information together in a special manner or proposing alternative methods.
(f) Evaluate -This is the ability to develop justification and defending an opinion by making judgement about information, the validity of ideas, or the quality of work based on a set of criteria.
80 5. Review of literature

Setting examination
This is the process of preparing questions to use in assessing the concept taught (Ogula et al., 2006). Ogula is of the view that all the processes of setting exams should be made internally. The processes involved in setting exams are exam setting, moderation, vetting by the external examiner, printing, and proofreading. All these processes consume valuable time and at times may subject exams to leakages if mishandled.
A quality exam should factor in the six Bloom's cognitive domains of knowledge (Bloom, 1994); knowledge, comprehension, application, analysis, synthesis, and evaluation. An exam consists of two sections, the questions' part, and the part of the answer. Questions have properties like mark(s), topic, and complexity (cognitive level). Marks are assigned to each question and determine the weight of the question in the exam. This assignment is influenced by the level of study and the question's complexity. A question examines an area of study (topic) and its complexity indicates the cognitive level. A question can be classified as very simple, simple, moderate, hard, or very hard. Each question should aim to test a certain cognitive level as described in the revised Bloom's taxonomy.

Automation techniques in questions classification
The issue of classifying exam questions based on Bloom's taxonomy has received considerable critical attention in recent years. To handle this task, researchers use different techniques and features (Omar et al., 2012). In this study, MLA and NLP are used. The machine learning algorithms used are K-Nearest Neighbors (KNN), Logistic Regression (LR), and Support Vector Machine (SVM). The study aimed to combine two features: word2vec and TFPOS-IDF (W2VTFPOS-IDF).
Verbs and actions were used to demonstrate different levels of learning (Diab & Sartawi, 2017). The solution was based on the classification of the action verb of the questions or learning outcome statement (LOS), to classify the whole question or LOS into a more accurate level. Action verb classification algorithm was applied on the verb lists from questions and LOS to compute the maximum similarity for every level of the cognitive domain. A rules-based approach was used. The study was concluded by the finding that the approach can be used to provide more accurate verbs and in turn, provide more accurate intended mental skills.
A document analysis method was used by (Karamustafaoglu et al., 2011). The research noted that teachers were asking many questions at the first three levels of Bloom's taxonomy. This study indicated that most teachers fear that the student may not pass the test and therefore resolve to set questions on the low cognitive levels. It concludes by recommending consideration to the questions at the higher cognitive levels to facilitate critical thinking. Surface learning is entertained by assessment strategies that reward low-level outcomes (Buick, 2011).
A comparative study of SVM and K-NN was done by (Patil & Shreyas, 2018) in an attempt to achieve better performance and high quality. Grammar and context checks were applied. The classification was used to test the student level and skills gained compared to Bloom's taxonomy cognitive levels. Support Vector Machines (SVM) algorithm was used by (Yahya et al., 2012). The classification algorithm was divided into three steps; text representation, SVMs classifiers construction, and SVMs classifiers evaluation. This technique was evaluated by varying the frequency of stop words. The research observed that an increase in the number of words used to represent the question lowers the performance of SVM. It concluded that the number of stop words should be more than one for a good performance and that reducing the number of stop words does not significantly improve performance.

Automation techniques in exam generation
Computer technology is rapidly changing. This has, therefore, contributed to the development of ideas and algorithms. A computer system can be made to simulate the process of generating exams. Such a system needs to coherently accommodate the discussed items to successfully examine learning. They include; cognitive level, topic, and weight/mark(s).
Despite the need to automate the process of exam generation in institutions, the success of the system must always fulfill certain parameters. Approach, tools, and algorithms used in the development phase play a significant role in fulfilling the addressed need. The quality of E-Systems is determined by views and usages (Nabil et al., 2011).
The question bank is the storage area for the questions fed into the system. Filtering criteria may be adopted which include; exam paper generation process, exclude/include past semester, the total number of items per paper, item complexity, maximum items per topic, paper topic settings, test paper generation, items analysis as described by (Yusof et al., 2017).
An automated paper generation system done by (Bhirangi & Bhoir, 2016) focused on controlled access, questions randomization, and user roles. The use of the cognitive level is not clearly outlined. The software was developed using Java programming language and MySQL database for storage. The algorithm used is improved on the randomization of questions.
Artificial intelligence, randomization, and backtracking are the algorithms used by (Cen et al., 2010) in their project to automate the exam generation process. Technologies used in this system are the MVC pattern in JSP view, JavaBean models, the Servlet Controller, MySQL, CSS + DIV for layout, and JavaScript. Cognitive level and questions weight are not addressed. JSP and Java Servlets are being replaced by emerging technologies. The system produces a word document that can be edited and sometimes loses layout due to compatibility issues.
The cognitive level is used by (Joshi et al., n.d.) in their e-system. Two algorithms; random selection and backtracking, are used. The use of artificial intelligence is not clear. The weight of the questions is computed as a percentage.
Generation of examination should indulge in the use of Natural Language Processing (NLP) as recommended by (Joshi et al., n.d.). This would focus on understanding the question's cognitive levels and prevent a question from being used most frequently.
Package exams developed by (Grun & Zeileis, 2009), provides software infrastructure for scalable exams, associated self-study materials, and joint development. The software used maintenance, variation, and correction as design principles. Technologies used are Latex and R. Questions were separated into answers and solutions sections. Some meta information is collected. Question and a solution description are encapsulated in Latex. In this approach every exercise is contained in a separate sweave file, therefore you need separate files for each. This method was used to make a custom application for processing statistical exams.
Natural Language Processing is used to process text and Named Entity Recognizer and Semantic Role Labeler are used to identify the semantic relation (Rakangor & Ghodasara, 2015). The main focus was to generate simple questions that are true or false or require a one-word answer.
An online system by (Hameed & Abdullatif, 2017) utilized web-based technologies; PHP, MySQL database. Three types of questions were taken into consideration which is true/false, multiple choices, and image matching. This system did not factor in artificial intelligence, cognitive level, or even question weight.
A rule-based classification approach was used to classify exams by (Kumara et al., 2019;Kumara, Brahmana & Paik, 2019). The model established enabled adjustment of the paper quantitatively. Though the model worked to classify the questions using cognitive levels the research concluded by recommending the introduction of machine learning techniques to increase performance.

Conclusion
Examination plays a key role in evaluating what the student has learned and requires to be performed with high precision. Examiners should come up with questions that are sensitive to the cognitive levels outlined by Bloom's taxonomy to ensure that the levels form part of consideration during exam generation. The process of questions classification can be automated by utilizing advancement in technology that presents the world with techniques in AI specifically ML and NLP. A combination of these technologies is resourceful in predicting the question's cognitive levels and realization of a standard examination.