Automating Online Proctoring Using AI

Semi-automate proctoring based on vision and audio based capabilities to prevent cheating in online exams and monitor multiple students at a time.

Vardan Agarwal

Jun 28 ·6min read

qiiaYvV.jpg!web

Photo by Everyday basics on Unsplash

With the advent of COVID-19, remote learning has blossomed. Schools and universities may have been shut down but they switched to applications like Microsoft Teams to finish their academic years. However, there has been no solution to examinations. Some have changed it to an assignment form where students can just copy and paste from the internet, while some have just canceled them outright. If the way we are living is to be the new norm there needs to be some solution.

ETS conducts TOEFL and GRE among others is allowing students to give exams from home where they will be monitored by a proctor for the whole duration of the exam. Implementing this scheme at a large scale will not be plausible due to the workforce required. So let’s create an AI in python which can monitor the students using the webcam and laptop microphone itself and would enable the teachers to monitor multiple students at once. The entire code can be found on my Github repo .

The AI will have four vision-based capabilities which are combined using multithreading so that they can work together:

Gaze tracking
Mouth open or close
Person Counting
Mobile phone detection

Apart from this, the speech from the microphone will be recorded, converted to text, and will also be compared to the text of the question paper to report the number of common words spoken by the test-taker.

Requirements

OpenCV
Dlib
TensorFlow
Speech_recognition
PyAudio
NLTK

Vision-Based Techniques

Gaze Tracking

ny2a222.jpg!web

Photo by S N Pattenden on Unsplash

We shall aim to track the eyeballs of the test-taker and report if he is looking to the left, right, or up which he might do to have a glance at a notebook or signal to someone. This can be done using Dlib’s facial keypoint detector and OpenCV for further image processing. I have already written an article on how to do real-time eye-tracking which explains in detail the methods used that will be used later on.

Real-time eye tracking using OpenCV and Dlib

Learn to create a real-time gaze detector through the webcam in python with this tutorial.

towardsdatascience.com

Mouth Detection

Mouth Tracking Results

This is very similar to eye detection. Dlib’s facial keypoints are again used for this task and the test-taker is required to sit straight (as he would in the test) and the distance between the lips keypoints (5 outer pairs and 3 inner pairs) is noted for 100 frames and averaged.

If the user opens his/her mouth the distances between the points increases and if the increase in distance is more than a certain value for at least three outer pairs and two inner pairs then infringement is reported.

Person Counting and Mobile Phone Detection

I used the pre-trained weights of YOLOv3 trained on the COCO dataset to detect people and mobile phones in the webcam feed. For an in-depth explanation on how to use YOLOv3 in TensorFlow2 and to perform people counting you can refer to this article:

Count people in webcam using pre-trained YOLOv3

Learn to use instance segmentation (YOLOv3) to count the number of people using its pre-trained weights with TensorFlow…

medium.com

If the count is not equal to an alarm can be raised. The index of mobile phones in the COCO dataset is 67 so we need to check if any class index is equal to that then we can report a mobile phone as well.

Combining using Multithreading

Let’s dive into the code now. As eye-tracking and mouth detection are based on dlib we can create a single thread for them and another thread can be used for the YOLOv3 tasks: people counting and mobile detection.

First, we import all the necessary libraries and along with the helper functions. Then the dlib and YOLO models are loaded. Now in the eyes_mouth() function, we find out the facial key-points and work on them. For mouth detection, the original distances between in the outer and inner points are already defined and we calculate the current ones. If a certain amount is greater than the predefined ones, then the proctor is notified. For the eyes part, we find out their centroids as shown in the article linked and then we check which facial keypoints are they closest to. If both of them are on the sides then it is reported accordingly.

In the count_people_and_phone() function, YOLOv3 is applied to the webcam feed. Then the classes of objects detected are checked and appropriate action is taken if more than one person is detected or a mobile phone is detected.

These functions are passed to in separate threads and have infinite loops in them which the proctor can break by pressing ‘q’ twice.

Audio

FbURJnN.jpg!web

Photo by Ethan McArthur on Unsplash

The idea is to record audio from the microphone and convert it to text using Google’s speech recognition API. The API needs a continuous voice from the microphone which is not plausible so the audio is recorded in chunks such there is no compulsory space requirement in using this method (a ten-second wave file had a size of 1.5 Mb so a three-hour exam should have roughly 1.6 Gb). A different thread is used to call the API so that a continuous recording can without interruptions, and the API processes the last one stored, appends its data to a text file, and then deletes it to save space.

After that using NLTK, we remove the stopwords from it. The question paper (in text format) is taken whose stopwords are also removed and their contents are compared. We assume if someone wants to cheat, they will speak something from the question paper. Finally, the common words along with its frequency are presented to the proctor. The proctor can also look at the text file which has all the words spoken by the candidate during the exam.

Until line 85 in the code, we are continuously recording, converting, and storing text data in a file. The function read_audio() , as its name suggests, is used to record audio using a stream passed on to it by stream_audio() . The function convert() uses the API to convert it to text and appends it to a file test.txt along with a blank space. This part will run for the entire duration of the examination.

After this, using NLTK, we convert the text stored to tokens and remove the stop-words. The same is done for a text file of the question paper as well and then common words are found out and reported to the proctor.

Requirements

Vision-Based Techniques

Gaze Tracking

Real-time eye tracking using OpenCV and Dlib

Learn to create a real-time gaze detector through the webcam in python with this tutorial.

towardsdatascience.com

Mouth Detection

Person Counting and Mobile Phone Detection

Count people in webcam using pre-trained YOLOv3

medium.com

Combining using Multithreading

Audio

Recommend

How to sideload any application on Android TV

新工作第三十八和第三十九周

Hacker News front page in the style of a print newspaper

ELF: Better Symbol Lookup via Dt_gnu_hash

Implementing the Exponential Function

Bringing Kafka ACLs to Kubernetes the declarative way

Roy Fielding's Misappropriated REST Dissertation

Rust for JavaScript Developers - Tooling Ecosystem Overview

自如回应降租：均向业主按时打款个别在“友好协商”

全球最大新冠疫苗生产车间在中国建成，具备年产1亿剂次能力

About Joyk