Classroom monitoring through Computer Vision

Analysing a study by Anh et al. (2019)

Jan 29, 2021

Today we will be taking a look at a study by Anh et al.(2019) that proposes a computer-vision based application for the monitoring of student behaviour in classrooms.

There are a lot of factors affecting the performance of a student, such as the academic infrastructure, finances, teachers, learning environments and the learner’s behaviour. The latter includes factors like motivation, attitude and skills. In order for a student to succeed and thrive, a teacher must recognize and understand the attitude that a student brings to the classroom. Identifying bad behaviour in the classroom is important. This is quite straightforward in a small classroom, but what about a classroom of 20+ people? It’s a huge challenge for a teacher to recognize the behaviour of every individual student.

Studies have shown that students that pay more attention in class have a higher probability of achieving better results, whereby there is a positive correlation between student achievement and their degree of attentiveness (Shannon, 1942). Nowadays, digital technology can lead to severe learning distractions, such as texting. Students engaging in these sorts of activities are likely to be distracted during class.

Monitoring these sorts of behaviours can enable teachers to recognize a student’s disengagement. Computer vision can be a fantastic tool to achieve this.

Computer-Vision can help monitor these activities

Computer vision is one of the most important technologies that enables the digital world to interact with the physical world. It enables computers to understand the content of images and videos (real world scenes). Computer vision techniques can identify, track, measure and classify specific objects in the video / image. So how does this relate to monitoring students?

Source: https://xrlabs.co/our-capabilities/cv/

Understanding the educational needs of students is important. The analysis of student data collected in this field can guide teachers when designing or redesigning courses, implementing new assessments and structuring the communication channels with students. Computer-vision can be used to make this process efficient, accurate and successful.

A system analysing this data must monitor attention in the classroom, with a main focus on quantifying body motion and estimating the eye-gaze direction (Anh et al, 2019). Imagine a system where several cameras are placed around the room, collecting data on body-motion and facial recognition. The system can essentially be trained to recognize a student’s face and subsequently attribute the motions to that student.

Monitoring the eye-gazing of students can indicate whether a student is mainly looking at the teacher, the notebook or other directions (e.g. the door, the clock, the student’s high school crush). Now, attentive students tend to have a common behaviour pattern, engaging in similar types of motions. The researchers therefore propose a system that is based on head movements and eye-gaze direction, whereby the synchronization of a student’s head orientation and the movement of the teacher are used as an indicator for the student’s attentiveness.

Let’s take a look at what this system could look like.

The proposed system

The researchers proposed a system based on 7 components: recorder (responsible for recording), recorder controller (deciding which recorder will record from which camera), task repository (storing the videos and metadata), task assignment manager (automatically retrieving schedules and arranging tasks to workers), worker (process tasks assigned and write them to the report database as results), and report (Database) and web server (visualizes the data). Within the worker module exists the AI module, which is considered to be the soul of the system. It is composed of 4 stages: data retrieving, frame processing, summarizing data, and output to the database.

During the data retrieving stage, the AI module retrieves all of the information (video recording, student lists, etc.) which were given by the task assignment manager. The data retrieved from the video and metadata is passed to the frame processing stage, which then processes every video frame and outputs facial and eye-gaze data. The data is then summarized in the summarizing data component. Within the face classification component, the student list from the metadata and the facial data are used for the final classification.

Position estimation is also an important component. The relative positions of the students in the classroom matter. Two students at the same table may both be looking to the left, but one student may be looking at the board whilst the other student is looking at something else. The frame acquired from the camera can be transformed into a 3D coordinate system, based on a range of functions.

Picture A shows the frame that is acquired from the camera, whereby B shows the 3D coordinate system of the student sample. The left image shows the position of the students from the perspective of the board, whilst the right one shows the perspective from top to bottom. This type of technology can be used to estimate the positions of students.

The student ID, the position of the student and the gaze can then be combined to monitor the students and detect their motions. To begin with, the student ID must be identified and located, so that the tracked behaviours can be attributed to them at a later stage. In a next step, the row and column of the “position matrix” are evaluated, representing the current position of the students in the class, which will be combined with the head-pose direction and the direction of the gaze vector to denote the origin (think about the previous example with two students looking towards the left). Lastly, the gaze plays a vital role in the system, checking whether students are looking at the slides, their notebook, their phone or anything else really! The gaze-data enables educators to observe the behaviour of students.

The researchers used the latest deep learning models, 3D coordination methods and gaze estimation technology to improve the performance of their model. They applied the SSH face detector for the face detection model in the frame processing stage and Hopenet, which is a head pose estimation network (Github link: https://github.com/natanielruiz/deep-head-pose).

We aren’t too far away from potentially implementing a system like this in schools. The costs associated with this model might still be a bit too high for the average school, but as with any technology, iterative development could eventually make it affordable. In my opinion, it’s only a matter of time until computer-vision will increasingly be used to monitor student behaviour. It is also useful for fraud detection during examinations (cheating).

What do you think? Will we ever see this type of technology in a classroom?

About IntoEdtech

IntoEdtech covers the latest and most promising technological trends in education. We summarise big topics in short blog posts, covering a range of technology applications in the education sector.

Subscribe to the newsletter (its free!) if you would like to receive more posts like this one directly into your inbox!