Smart sensors, face recognition at a distance and the smart city: an outline of the relevance of the problem
Facial recognition applications have proven useful in a variety of contexts, typically associated with the processes of monitoring and surveillance (global and reactive) of private and public areas. Monitoring tends to be interpreted as an activity that is situational, consistent with watching and checking a situation carefully for a period of time to discover something about it. Surveillance is a more contentious concept21,22,23, even if commonsensically interpreted as identical to monitoring. Surveillance is most closely associated with the continuous and systematic collection, analysis, and interpretation of data needed to design, implement, and evaluate certain strategies, policies, action plans, etc. In the context of safety and security, a well-designed surveillance system would integrate sophisticated AI-based tools and applications, including facial recognition-based tools, with technologies and techniques that enable data mining, data processing, and data analysis. Eventually, these should feed the management and decision-making process24. In addition, successive point-of-interest (POI) recommendation has become a hot research topic in augmented Intelligence of Things (AIoT)25,26,27, which can be of great importance in security applications in a near future.
Smart cities represent in this context one of the most interesting test-beds for monitoring and surveillance systems7. It is because, on the one hand, the integration of the built environment and the ICT infrastructure creates the basic conditions for the installation and utilization of devices necessary for the purposes of monitoring and surveillance. On the other hand, the density and velocity of social interaction in the urban space, including traffic, congestion, mobility, trade, etc., are a source of demand for a form of monitoring and surveillance. Monitoring and surveillance systems in use in the (smart) city space fulfill several objectives, including traffic lights control, traffic and congestion management (for instance during rush hours or emergencies), optimization of public transportation use, preempting risks to public safety (e.g. averting mob creation or mitigation of disease spread, like Covid-19). Increasingly, facial recognition-based tools are becoming a part of these systems, although they also raise serious questions related to the collection, storage, and use of personal data thus collected.
Supporters of the utilization of facial recognition-based tools in the (smart) city space tend to argue that solutions of this kind are convenient, while at the same time allowing the detection of suspicious (or irregular) behavior, identification of perpetrators sought by the authorities, etc. As always, the question is who and on which grounds has the right to collect the data, who has the right to use the data and for which purposes, how safe storage of the data is ensured, and so on. Researchers involved in this field of research are aware of the inherent risks of facial recognition schemes applied and used in China28,29, not only on the grounds of bias inherent in AI/ML in connection with the facial recognition and interpretation process. Certainly, along with technological progress and the emergence of increasingly sophisticated technologies that can be applied in the fields of surveillance and monitoring, several contentious issues will arise30,31.
However, the legal and ethical challenges hinted at in this section should not impede the research and discovery process. Consider the case of dual application technologies32,33. Consider, for instance, that smart monitoring and surveillance systems have the capacity of detecting dangerous situations caused by negligence, e.g. car accidents, or even preempting risks by identifying and setting alert should serious speed limit violations in the city space be taking place. In other words, in the quest for even more efficient and more accurate ways of navigating problems inherent in the already existing tools and applications, like in the case of the discussed here at distance facial recognition capacity, it is imperative that the ethical and legal considerations are taken into account. Only in this way can we design and implement technologies and tools that will serve society at large.
Related work
The efficiency of face recognition algorithms is closely related to the quality of the input images. Consequently, this is especially challenging when working with low-quality images, which is also often the case in many applications of video surveillance. In relation to face recognition using security cameras, previous works have mainly focused on topics such as low resolution, facial recognition with infrared security cameras, or recognition at a distance.
Low-resolution face recognition is still a challenging problem. Since biometric systems have generally been trained with high-quality images, low-resolution input images are likely to result in errors in the recognition process. In recent years, there have been a significant number of works related to low-resolution face recognition that use deep learning techniques with Convolutional Neural Networks (CNN). Most of these works are based on image super-resolution34 and, generally speaking, there are two alternatives for the recognition process: (i) input low resolution images are compared with high resolution images in the dataset; (ii) both input and gallery images are low resolution face images. Therefore, Cheng et al.35 presented a new super-resolution deep learning method where joint learning of super-resolution and face recognition is introduced. Then, a two-stream CNN is first initialized to recognize high-resolution faces and low-resolution faces at the same time for video streaming applications36. Moreover, Yu et al.37 introduced two super-resolution networks with high accuracy for a single image super-resolution, in comparison to several previous works. In38 a residual learning deep neural network based gradient image super-resolution solution is developed. However, these deep learning face recognition models generally show a substantial degradation in performance when evaluated with native low-resolution face images and have other limitations, such as their difficulty to extract discriminative features from low-resolution faces and the trade-off between accuracy and computational cost.
Regarding facial recognition in video surveillance, recent works include39, where video surveillance applications are studied in depth, considering that low resolution can greatly affect the reliable recognition of individuals. An automatic pose invariant for Face Recognition At a Distance (FRAD) is presented in40, where a 3D reconstruction of faces is performed and Local Binary Patterns (LBP) are used for classification. As in many other areas, deep learning is also used for facial recognition in video surveillance (see41,42,43). These works have limitations in terms of their reliance on stereo pair images, limited consideration of other variations and there is no evaluation on large-scale datasets.
On the other hand, Wheeler20 stated that to train and test models, there is a lack of datasets focused on video surveillance and, moreover, it is common to use downsampling to simulate images captured at a distance. Therefore, Grgic et al.17 presented a face image database, the SCFace dataset, taken in an uncontrolled indoor environment using five commercial video surveillance cameras. The database has 130 different people and a total of 4,160 images with different qualities and at different distances. Although there are some other datasets presented in the last few years, they cannot be currently accessed due to recent changes in data privacy laws.
From these works it comes clear that face recognition at a distance poses several significant limitations. First, the resolution and quality of the captured images decrease with distance, leading to a reduction in the amount of critical facial information available for analysis. This reduction in resolution can be exacerbated in real-world scenarios where factors like poor lighting conditions, occlusions, and varying camera angles further hinder the process.
Secondly, the effectiveness of face recognition at a distance is highly dependent on the available hardware and imaging capabilities. Long-range surveillance cameras may struggle to capture clear and detailed facial images, which can impact the accuracy and reliability of recognition systems. Additionally, the computational demands of processing distant faces can be considerable, requiring specialized and powerful hardware, which might not always be feasible or cost-effective for widespread deployment. To overcome all the limitations presented in this section, ongoing research and development are necessary to improve the robustness, accuracy, and ethical implications of face recognition technology at a distance. Proper data collection, algorithm improvements, and the establishment of ethical guidelines are crucial steps in effectively addressing these challenges.
As a conclusion, facial recognition at a distance for security applications has mainly dealt with near-images captured by sensors that do not use leading edge technology. State-of-the-art surveillance cameras have very good resolution at close distances, but as the target person becomes increasingly distant, errors due to low-resolution face recognition may emerge. Consequently, our work aims to study the behavior of current image sensors when capturing images at long distances and to test how our face recognition algorithm is able to work with low-resolution images. In this way, a comparison of the accuracy of the recognition for each image sensor with respect to the distance of individuals from the lens will also be taken into account.
Implementation
As mentioned in the previous sections, the main goal of this work is to extract faces from images taken by surveillance cameras, then use a deep learning model to recognize users in those images, and finally check the behavior of our system with a set of commercial image sensors at several distances. Figure 1 shows graphically how our system works.
Implementation of a facial recognition system at a distance
A combination of Machine Learning and Deep Learning techniques, using Python44 as the programming language, has been taken into account to develop our method. Transfer learning45 techniques have also been used to optimize results and save training time. The overall architecture of our approach is shown in Fig. 2. The different steps followed in our model are explained below.
Face detection
For face detection, our model is based on the Single Shot Detector (SSD) framework with a reduced ResNet-10 model (for an extensive review of this model, see46,47), where the pre-trained model res10 ssd48 has been used. This model was created with the Deep Learning CAFFE library49 and gives robust results when identifying people’s faces at different distances and positions. In our case, it provides the position of the faces in the training and test datasets so that facial recognition can be performed afterwards.
Facial recognition
For facial recognition, once we have detected the position of the face, a feature extractor based on the VGG_Face model50 has been used, which was trained with the ImageNet database, obtaining 96.7% of success in face recognition in the near field. Let us consider that, in a general way, any input image N faces are detected, and that the training set will have M faces.
VGG_Face is one of the most popular and widely used facial recognition models. In this network, a feature vector of 2048 elements is taken from the last fully connected layer for each face image. Afterwards, the cosine distances51 between the feature vectors obtained from the faces detected in any input image to the system \(\{A_1, \ldots , A_N\}\) and those from the training set of images \(\{B_1, \ldots , B_M\}\) are calculated. The cosine similarity between two vectors \(\vec {A}\) and \(\vec {B}\), \(\cos (\vec {A},\vec {B})\), can be defined as:
$$\begin{aligned} \cos (\vec {A},\vec {B}) = \frac{\vec {A}\cdot \vec {B}}{\Vert \vec {A} \Vert \Vert \vec {B} \Vert } \end{aligned}$$
(1)
The cosine distance \(d_C\) between two vectors \(\vec {A}\) and \(\vec {B}\) is:
$$\begin{aligned} d_C= 1-\cos (\vec {A},\vec {B}) \end{aligned}$$
(2)
Then, a One-shot learning technique52 is applied. This technique is a classification task where one example (or a very small number of examples) is given for each class (an individual, in our case), that is used to train the model and make predictions about many unknown examples from the testing set.
The pseudo code for our proposed algorithm is shown in Algorithm 1.
Generation of a dataset for FRAD
The lack of databases with images of faces taken at different distances has been reported in the literature20. Moreover, in the last few years some of the most popular datasets for face recognition at a distance cannot be accessed due to data protection laws. For this reason, and to test the detection range of our face recognition system with different sensors, an extended dataset from a regular dataset has been generated53,54. The new dataset contains images of different individuals with high resolution.
The generation of the dataset had to be realistic, since an image sensor does not collect the same visual information of a person at a particular distance as a smaller sensor at the same distance. Thus, the size of an individual’s face in an image taken by a surveillance camera in an urban area, where several other individuals may appear, will be very small the further away the target person is. As that person gets closer to the camera, his/her face will start to cover bigger areas of the image. Moreover, if the camera sensor has more resolution, that is, if it is able to take more pixels per image, a person’s face at a certain distance will use more pixels than the same face at the same distance, but captured by a camera with a smaller sensor.
Consequently, having an estimate of the size in pixels that a human face will occupy at a given distance using a particular optical sensor would be extremely useful. In our case, once this value is calculated, an extended dataset from the original high resolution dataset will be generated for that value by applying an antialiasing filter to downsample each original image to obtain the new dataset55,56,57. As a result, a new dataset at different distances and for different image sensors can be obtained.
To calculate the size in pixels of a human face in an image when the individual is at a distance from the camera, let us consider the relationship between the focal length in an optical system DF58, the distance from the object to the sensor itself d and the size of the sensor. In addition, the form factor of the captured image is considered to maintain the aspect ratio of the image. As a result, the distance can be calculated as:
$$\begin{aligned} d=\frac{D F \cdot H_R \cdot H_V}{H_P \cdot H_S} \end{aligned}$$
(3)
In (3) DF is the focal length in mm, \(H_R\) is the actual height of the object (that is, the human face), \(H_V\) is the vertical height in pixels of the image, \(H_P\) is the height in pixels that the object should have at distance d and \(H_S\) is the physical height of the sensor in mm. See Fig. 3 for a graphical interpretation of Equation (3). An approximate size of 216 mm has been given to \(H_R\), since previous works stated that this value can be considered as the average head size for an adult person59,60.
As a result, this approach allows us to synthetically expand the samples in a dataset to perform a study of which sensors behave better at certain distances from an object (face, in this case) recognition scheme. Some previous studies61,62 have shown that when working with synthetically created low-resolution images, the results can be slightly better than when using real low-resolution images. This small improvement in the recognition accuracy is considered acceptable, mainly due to the fact that (i) there is no dataset to address the problem presented in this article and (ii) regardless of the sensor used, this improvement acts as an offset on the sensors, as shown in previous works. Consequently, the results obtained in the next section are valid and can be extended to real datasets with slight changes in the accuracy results.