View-Invariant Template Matching Using Homography Constraints
Sina Lotfian, Hassan Foroosh

TL;DR
This paper introduces a view-invariant object recognition method that matches objects across different viewpoints without requiring prior 3D knowledge or camera restrictions, leveraging homography constraints.
Contribution
It presents a novel approach that uses homography properties to achieve view-invariant matching without prior 3D models or camera orientation restrictions.
Findings
Method is robust to noise.
Effective in face and object recognition tasks.
No need for prior 3D object models.
Abstract
Change in viewpoint is one of the major factors for variation in object appearance across different images. Thus, view-invariant object recognition is a challenging and important image understanding task. In this paper, we propose a method that can match objects in images taken under different viewpoints. Unlike most methods in the literature, no restriction on camera orientations or internal camera parameters are imposed and no prior knowledge of 3D structure of the object is required. We prove that when two cameras take pictures of the same object from two different viewing angels, the relationship between every quadruple of points reduces to the special case of homography with two equal eigenvalues. Based on this property, we formulate the problem as an error function that indicates how likely two sets of 2D points are projections of the same set of 3D points under two different…
| Dataset | Accuracy | No. of Classes | Dataset Size |
|---|---|---|---|
| Coil-20 | 86.1 | 20 | 1440 |
| Pointing04 | 77.7 | 15 | 2690 |
| UMIST | 92.5 | 20 | 565 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
View-Invariant Template Matching Using Homography Constraints
Sina Lotfian and Hassan Foroosh Sina Lotfian and Hassan Foroosh are with the Department of Computer Science, University of Central Florida, Orlando, FL, 32816 USA (e-mail: [email protected], [email protected]).
Abstract
Change in viewpoint is one of the major factors for variation in object appearance across different images. Thus, view-invariant object recognition is a challenging and important image understanding task. In this paper, we propose a method that can match objects in images taken under different viewpoints. Unlike most methods in the literature, no restriction on camera orientations or internal camera parameters are imposed and no prior knowledge of 3D structure of the object is required. We prove that when two cameras take pictures of the same object from two different viewing angels, the relationship between every quadruple of points reduces to the special case of homography with two equal eigenvalues. Based on this property, we formulate the problem as an error function that indicates how likely two sets of 2D points are projections of the same set of 3D points under two different cameras. Comprehensive set of experiments were conducted to prove the robustness of the method to noise, and evaluate its performance on real-world applications, such as face and object recognition.
Index Terms:
Object Recognition, View Invariance, Homography, Homology
I Introduction
Object recognition from raw images given one or more examples as template(s) is a challenging problem that has important applications in diverse areas of computer vision such as image annotation [142, 143, 144, 140, 139, 141], self-localization [80, 79, 85, 86, 83, 87], surveillance [72, 78, 136, 84, 135, 13, 138, 76, 75, 77, 71, 70, 14, 9, 133], human action and interaction recognition [129, 15, 10, 137, 126, 134, 16, 124, 127, 11, 12, 35, 125, 128, 8], target localization and tracking [131, 97, 99, 122, 98], shape modeling and pattern recognition [38, 37, 148, 94, 101, 2, 1, 54, 3, 36, 53], and image-based rendering [47, 48, 130, 28, 145, 100, 4, 5, 65, 22, 146, 44]. The problem is often exacerbated by issues such as image quality, noise, and drastic appearance changes caused by viewpoint variations. Although, preprocessing steps such as image enhancement [113, 121, 116, 46, 33, 117, 93, 111, 107, 108, 112, 109, 32, 123, 34, 69, 114, 115, 120], and registration [64, 59, 29, 30, 6, 18, 19, 119, 60, 118, 26, 25, 63, 24, 61, 57, 110, 23, 21, 58, 56] may help in tackling some of the challenges, viewpoint variations remain by and large challenging.
The variation in pose and viewpoint can cause distortion in the feature space to the extent that many recognition algorithms may fail to recognize objects. The relationship between the rotation and translation of an object in the 3D world and the changes in the coordinates of pixels in the 2D image plane is also not trivial. Algorithms dealing with variation in viewpoint usually make assumptions either about change in feature space caused by relative 3D transformations, or about the position and orientation of the camera or requiring autocalibration to estimate the camera parameters [45, 43, 51, 81, 40, 42, 39, 73, 82, 50, 62, 74, 89, 90, 91, 7, 88, 96, 27, 41, 41, 49] from the images in order to account for viewpoint and camera parameter changes. Learning viewpoint manifolds [55] [20] and the latent spaces for viewpoints [106] [105] are two popular approaches taken by researchers for this problem, but they require simplifying assumptions in order to solve the problem. In this paper, a geometric approach is taken to address this problem and a solution in the most general case is provided.
We propose a template matching method based on image-domain relations in the projective space that can match objects across any pair of poses as long as the template image and the probe image have enough overlap for keypoint extraction. We prove that for one object seen by two cameras, with arbitrary intrinsic and extrinsic camera parameters, a restriction applies on the eigenvalues of the homography matrices associated with any quadruple of keypoint correspondences. By exploiting this constraint, an error function is introduced that is able to estimate how likely the provided reference and test images belong to the same object under different viewpoints.
The novelty of the paper can be summarized as follow:
We propose a template matching method that can match the given template with any inquiry image even under a wide baseline and viewpoint changes, as long as they have overlaps.
Unlike learning-based methods, the proposed approach does not need separate training data for each viewpoint. We also do not make any assumptions on the orientation of the cameras or their intrinsic matrices.
II Related Work
One common approach to tackle the varience in pose is to find latent spaces where the correlation between two views are maximized. Canonical correlation analysis(CCA) [106] projects the data from two views into two low dimensional subspace which are highly correlated. Sharma et al. [105] have extended CCA method so that it exploits the labels of training data to find a more discriminative projection direction. Both of the mentioned methods can exploit kernels to model non-linearity. Although methods based on latent spaces have proven to be powerful tools for both multi-view image classification and multi-modal data classification, they require learning a projection direction for every viewpoint and their ability to generalize to unseen viewpoints is limited.
Another set of solutions try to fit the given 2D images to predefined 3D shapes of objects (e.g. a face) from a single view image [147][92]. In [17] authors propose a 3D pose normalization for face recognition in order to make it robust to variation in pose. [104] exploits 3D CAD models to detect and find the pose of objects such as bikes and cars. The use of these methods are restricted to objects with available predetermined 3D models. A rather interesting solution was proposed by [52] that does not require 3D reconstruction of the face, instead they use the cost of stereo matching as their error function. However, they make the assumption that epipolar lines are horizontal which does not hold true for the object recognition in the general case.
Ideally, we are in search of view-invariant recognition algorithms that require few training data (hopefully one shot learning) (OSL), generalizable to unseen view-points (GUV), work on objects without known 3D structure, (3DFree) and invariant to the internal camera parameters (IICP). Table I compares the various classes of algorithms described above, in terms of these desired properties.
In this paper, we take a geometric approach to the problem of viewpoint variation. Our work is inspired by Shen et al.[126], who used homographgy constraints to recognize body pose transitions between two successive frames of two video cameras, observing human actions. Although, we are not dealing with video frames in this work, we show that the concept can be extended also to a pair of still images of a rigid object (i.e. instead of dealing with moving points in space viewed by two pairs of frames (4 images), we can extend the idea to recognizing a rigid object from two images. The key to achieve this extension is to consider quadruple of points in each camera image, instead of triplets of points in two frames of each camera The result is a rigid object recognition method that can handle unknown viewpoints and internal camera parameters.
III Proposed Method
Given a reference image () and a query image (), our goal is to determine if they belong to the same 3D object under two different viewpoints or not. First, point correspondences are extracted between and , and represented as . Such correspondences can be obtained from any keypoint extraction and matching algorithm such as SIFT[95], SURF[31] or Harris[68]. For more clarity, we use upper case letters for 3D coordinates and lower case for 2D coordinates on the image plane. We introduce an error function that in the ideal case vanishes, when there exist a unique 3D configuration of points which map to the extracted 2D keypoint correspondences. Conversely, the value of the error function increases, if such 3D configuration is not possible. Furthermore, the proposed error function is fully projective (i.e. fully defined in the image domain) and hence is invariant to camera positions and its internal parameters.
Consider the object shown in Figure 1, which consist of four 3D points in general positions. Two cameras (C1 and C2) that are located in two different coordinates are imaging this object as and . In the most general case, the two cameras would be projective with 11 degrees of freedom (i.e. different intrinsic parameters and arbitrary orientations in the 3D space). Two key observations that lead to the proposed solution are: (1) Any three of the quadruple of points define a plane in the 3D space that induces a homography between the two cameras; (2) With a quadruple of points one can obtain 4 such planes, i.e. two pairs of homographies. Each pair of homographies plays a similar role as a moving plane considered in [126], except that in our case instead of a single plane moving in time, we are considering the dual case of two planes in a rigid body. Since this case is dual to the problem considered by [126], the construct remains the same. We illustrate this using the example of Figure 1.
Let two planes (orange) in Figure 1 and (blue) in Figure 1 correspond to the triplets of 3D points and , respectively. Let the corresponding image points be and . We assume that no three projected points are co-linear in either views. Let also and denote the epipoles in the two images. Since epipoles are mapped across two images by the homography induced by any plane in the scene, we have
[TABLE]
[TABLE]
These equations yield a pair of homographies through which we can define .
Proposition 1: will reduce to a homology if and only if the presumed point correspondences and are images of the same 3D point configuration.
The immediate consequence of this observation is that two of the eigenvalues of must be equal if the presumed point correspondences and are images of the same 3D point configuration. This allows us to define a cost function that would make it possible to determine if a set of image points and their matching correspondences from a template image are originated from the same 3D object. Suppose we have such template images and we establish putative point correspondences between the query image and each reference template. One can then define quadruples of point correspondences, yielding a total of matrices, , , for each template . Let and be the two closest eigenvalues of the matrix . Finding the optimal matching template is then a labeling process that would be given by:
[TABLE]
IV Experimental Results
In this section, the performance of the proposed method on both synthetic and real-world datasets is demonstrated with wide applications such as object and face recognition.
IV-A Synthetic Data
In order to understand the behavior of the error function in equation 5 in the presence of noise in key point localization, the process of projection of 3D points on the image plane is simulated using the pinhole camera model. The point clouds used for generating the synthetic objects are obtained from the BigBIRD [132] dataset, which consist of RGBD images of objects sampled on the the viewing hemisphere. Object ’Advil’ is chosen as the positive example and the object ’Syrup’ is chosen as negative example. It is expected that the error measure for ’Advil-Advil’ pair will be lower that ’Advil-Syrup’ pair.
Two cameras are used to generate synthetic images on the image plane. The first camera which is the reference camera is fixed at the world origin and is looking at the Z axis. The second camera or the test camera is moving on the viewing hemisphere. This is achieved by rotating the reference camera around and axis. Since the number of points in the cloud is over one thousand, we randomly choose 8 points as the keypoints and project only these 8 points on the image plane. The focal lengths for both cameras change randomly in the range . Then by adding Gaussian noise to the position of keypoints on the image plane, we measure the robustness of the algorithm.
In figure 2 the matching score for different viewing angles are plotted for both the matching query-template pair (the surface below) and the non-matching query-template pair (surface above). It can be observed that for the matching pair the error is almost zero, while for non-matching pairs the error is high. To find out the extent of separation between these (i.e. ability to distinguish between a correct and incorrect match), we added Gaussian noise to the position of the keypoints in the image planes. It can be observed that as the noise variance increases, the two error surfaces get closer and the ditinction between true match and a false match becomes harder. Our experiments show that we can handle noise strength of up to about which roughly equates the correspondences being 24 pixels off.
IV-B Real-World Data
We also tested the proposed method on real datasets, including a 3D multi-view object recognition dataset, and two multi-view face recognition datasets. The first dataset is coil-20 [102], which consists of 20 classes, each taken with the object rotated 5 degree on a turn-table. Pointing04 [66] and UMIST[67] are two multi-view face dataset used to evaluate the proposed method. UMIST consists of 575 faces from 20 different persons taken under different conditions. Pointing04 contains 2690 face photos taken from 15 people. The Pointing04 face rotation has more degrees of freedom and images are taken with and without glasses. Keypoints are extracted using the popular SIFT [95] descriptor and matched using a nearest neighbor method. Note that unlike many methods in the literature the proposed algorithm does not assume that the coordinates of facial keypoints, such as nose and lips are given, and it only relies on the features extracted by SIFT, which may lay anywhere on the face.
Although in theory our method needs only one template per class to match two images, there has to be enough overlap between the query and reference image so that the keypoint extractor algorithm can find enough mutual keypoints in both images. Therefore, in each dataset for every class, 8 images are chosen as the templates and the rest are used as query images. For instance, in the Coil-20 dataset 8 images are taken as templates and 64 used for the test phase. The overall accuracy for all dataset are provided in II.
V Conclusion
In this paper, a new view-invariant template-matching method is introduced that imposes no restrictions on external or internal camera parameters. The robustness of the algorithm has been tested by adding Gaussian noise to the coordinates of the keypoints on the image to simulate the behavior of error in keypoint localization. Finally, the accuracy of the method on object and face recognition was tested, producing remarkable good results.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Muhamad Ali and Hassan Foroosh. Natural scene character recognition without dependency on specific features. In Proc. International Conference on Computer Vision Theory and Applications , 2015.
- 2[2] Muhamad Ali and Hassan Foroosh. A holistic method to recognize characters in natural scenes. In Proc. International Conference on Computer Vision Theory and Applications , 2016.
- 3[3] Muhammad Ali and Hassan Foroosh. Character recognition in natural scene images using rank-1 tensor decomposition. In Proc. of International Conference on Image Processing (ICIP) , pages 2891–2895, 2016.
- 4[4] Mais Alnasser and Hassan Foroosh. Image-based rendering of synthetic diffuse objects in natural scenes. In Proc. IAPR Int. Conference on Pattern Recognition , volume 4, pages 787–790, 2006.
- 5[5] Mais Alnasser and Hassan Foroosh. Rendering synthetic objects in natural scenes. In Proc. of IEEE International Conference on Image Processing (ICIP) , pages 493–496, 2006.
- 6[6] Mais Alnasser and Hassan Foroosh. Phase shifting for non-separable 2d haar wavelets. IEEE Transactions on Image Processing , 16:1061–1068, 2008.
- 7[7] Nazim Ashraf and Hassan Foroosh. Robust auto-calibration of a ptz camera with non-overlapping fov. In Proc. International Conference on Pattern Recognition (ICPR) , 2008.
- 8[8] Nazim Ashraf and Hassan Foroosh. Human action recognition in video data using invariant characteristic vectors. In Proc. of IEEE Int. Conf. on Image Processing (ICIP) , pages 1385–1388, 2012.
