Rigid Body Structure and Motion From Two-Frame Point-Correspondences Under Perspective Projection
Mieczys{\l}aw A. K{\l}opotek

TL;DR
This paper investigates the feasibility of recovering 3D structure and motion of a rigid body from two frames under perspective projection, concluding that two frames are insufficient due to inherent uncertainty, unlike orthogonal projection.
Contribution
It demonstrates that, unlike orthogonal projection, perspective projection requires more than two frames for structure and motion recovery, highlighting the limitations imposed by perspective effects.
Findings
Two frames are insufficient for structure and motion recovery under perspective projection.
Perspective projection introduces uncertainty that prevents recovery from only two frames.
Orthogonal projection allows recovery from two frames, unlike perspective projection.
Abstract
This paper is concerned with possibility of recovery of motion and structure parameters from multiframes under perspective projection when only points on a rigid body are traced. Free (unrestricted and uncontrolled) pattern of motion between frames is assumed. The major question is how many points and/or how many frames are necessary for the task. It has been shown in an earlier paper {Klopotek:95b} that for orthogonal projection two frames are insufficient for the task. The paper demonstrates that, under perspective projection, that total uncertainty about relative position of focal point versus projection plane makes the recovery of structure and motion from two frames impossible.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotic Mechanisms and Dynamics · Optical measurement and interference techniques
Rigid Body
Structure and Motion From Two-Frame Point-Correspondences Under Perspective Projection.
Mieczysław A. Kłopotek
Institute of Computer Science, Polish Academy of Sciences
PL 01-248 Warsaw, 5 Jana Kazimierza St.,
e-mail: [email protected]
**Abstract. This paper is concerned with possibility of recovery of motion and structure parameters from multiframes under perspective projection when only points on a rigid body are traced. Free (unrestricted and uncontrolled) pattern of motion between frames is assumed. The major question is how many points and/or how many frames are necessary for the task. It has been shown in an earlier paper [6] that for orthogonal projection two frames are insufficient for the task. The paper demonstrates that, under perspective projection, that total uncertainty about relative position of focal point versus projection plane makes the recovery of structure and motion from two frames impossible. **
1 Introduction
Recovery of a three-dimensional structure from a single view of even the simplest scene consisting of a single object has been viewed as heavily underconstrained [8]. On the other hand, availability of multiframes may provide with additional constraints which may lead to solvability of the problem [8]. Therefore, the problem of recovery of rigid bodies from multiframes has been studied in the past. E.g. in the domain of orthogonal projections, Lee [8] has been concerned with rigid bodies consisting of two traceable points rotating around a fixed direction, Klopotek [4] has studied rigid bodies consisting of three traceable points subject to free (unrestricted) motion, and Klopotek [7] investigated rigid bodies consisting of two traceable points connected by a smooth 3D curve subject to free (unrestricted) motion. In the domain of perspective projections, Weng [13] deals with the recovery of motion and structure of rigid bodies consisting only of straight lines (13 of them). Roach and Aggarwal [10] [11] researched on motion and structure recovery tracing points under perspective projection assuming static scene and moving camera. They showed that five points in two views are needed to recover the structure and motion parameters. Their solution involved a system of 18 highly non-linear equations. Nagel [9] proposed a simplified equation system by separating solution for the translation vector and the rotation matrix, with rotation matrix being determined by a system of 3 equations in three motion parameters. Wang et al. [12] studied bodies consisting of four points and a line. Azarbayejami and Pentland [1] reviewed and studied problems of structure and motion recovery under unknown (but fixed) focal length.
Generally, much effort has been devoted to reducing the number of frames and traceable features (points, lines). This is understandable as on the one hand this reduces the effort required for tracing features and on the other hand there are more features left for validation and/or improvement of error resistance.
It is generally known that the amount of information provided by the frames shall at least balance the number of degrees of freedom involved. If the balance of degrees of freedom and of information is not achieved then results concerning structure and motion may be totally ambiguous. This may prove to be extremely difficult to notice when performing numerical computations, as especially for perspective projection most methods of recovery of structure and motion involve complex non-linear multivariate equation systems which may yield a unique solution (due to numerical round-offs or imprecision of observations) even if such a solution does not exist. This was demonstrated e.g. in [5] for the four-points-and-a-line algorithm from [12] for two frames for perspective projection (no. of degrees of freedom exceeding the information available)
On the other hand it appears that information available from frames may be divided into two categories: new information and redundant information [6]. If the balance of degrees of freedom and of information is achieved but the balance of degrees of freedom and of new information is not achieved then results concerning structure and motion may be partially ambiguous, as demonstrated in [6] for four-points problem for two frames for orthogonal projection (no increase in new information over three points due to a shift towards redundant information).
This paper demonstrates that perspective projections are also prune to the risk of emergence of redundant information. In section 2 we recall the problem of emergence of information redundancy as observed for orthogonal projection. Then in section 3 the situation for perspective projection is described where information redundancy emerges. In section 4 we discuss briefly how to prevent this information split and how to make use of it.
The paper ends with a brief discussion and some concluding remarks.
2 Split of Information Under Orthogonal Projection
2.1 Degrees of Freedom for Orthogonal Projection
Under orthogonal projection, each point of the body introduces 3 df in the first frame minus one df for the whole body as there exists no possibility of determining the initial depth of the body in the space. The motion introduces for each subsequent frame 5 df only (three for rotations and two for translation), because the motion in the direction orthogonal to the projection plane has no impact on the image. In general, with p points forming the rigid body traced over k frames we have degrees of freedom.On the other hand, within each image each traced point provides us with two pieces of information: its x and its y position within the frame. Hence we have at most pieces of information available from k images.Thus we need at least to have the balance
[TABLE]
to achieve recoverability.
2.2 Information Redundancy
As we can derive from the above equations, if the number of traced points is equal 4, then the amount of information may be sufficient to recover structure and motion from two frames. However, let us assume that we managed to match two frames with a 3D object consisting of 4 or more points, that is we construct an object in space and find positions of two projection planes such that the projection of the object on two planes gives the two observed frames. Then, as shown in in [6], we can rotate one of the frames along a specially selected axis by any angle just to obtain still another different 3D object that also matches the two frames.
This implies that forth and subsequent points do carry only one piece of information in the second image instead of two. Hence there exists no possibility of complete recovery of 3-D structure from two images.
The question seems at this point to be justified what happens with the one piece of information left unused. As shown in [6], they can be exploited for solving the problem of correct assignment of identities of points in two consecutive frames.
3 Information Redundancy Preventing Structure and Motion Recovery
from Two Frames under Perspective Projection
In most papers concerning recovery of structure and motion from multiframes the detailed knowledge of the geometry of the optical system of the camera is assumed, that is the precise position of the projection center point with respect to the projection plane is known. However, this does not need to be always the case. If we take the image e.g. from a household video camera (a fine one with auto-focusing) then the relative position of projection center point and the image plane is not only unknown but also varying over time. If images from a photographic camera of unknown type are available only then the information on relative position of image plane and focal point is also inaccessible. Below we demonstrate that under these circumstances it is impossible to recover structure and motion from two frames whatever number of traceable points is taken. First we demonstrate that it is not the overall number of degrees of freedom is the obstacle. Then we show that informational redundancy occurs consuming the information necessary for recovery of structure and motion.
3.1 Degrees of Freedom
Let us now consider the degrees of freedom for the perspective projection if we assume that the relative position (in space) of the focal point with respect to the projection plane is not known and may vary over time.
Each point of the body introduces 3 df in the first frame minus one df for the whole body as there exists no possibility of determining the scaling of the whole body. Additionally we have 3df due to the uncertainty of the location of the focal point. The motion introduces for each subsequent frame 9 df (three for rotations and three for translation of the projection plane plus three for translation of the focal point). In general, with p points forming the rigid body traced over k frames we have then degrees of freedom.On the other hand, within each image each traced point provides us with two pieces of information: its x and its y position within the frame. Hence we have at most pieces of information available from k images.Thus we need at least to have the balance
[TABLE]
to achieve recoverability. Let us consider some combinations of parameters:
- •
for k=2 frames, p= 10 points we get
- •
for k=2 frames, p= 11 points we get
- •
for k=2 frames, p=7 points we get
- •
for k=3 frames, p=7 points we get
- •
for k=3 frames, p=6 points we get
- •
for k=4 frames, p=6 points we get
- •
for k=8 frames, p=5 points we get
The above (in)equalities tell us that to recover structure and motion from 5 traceable points, we would need 8 images (frames), with 7 traceable points we need 3 frames, and to recover from two frames we would need 11 points - if we take the balance of degrees of freedom and the amount of information.
If we have only p=4 traceable points, then we get the number of degrees of freedom equal to -1+34+9(k-1)=9k+2, whereas the amount of information is equal to k24=8k, which is always less then the number of degrees of freedom. This means that if we trace only four points, we can never recover structure and motion whatever number of frames is available.
3.2 Emerging Information Redundancy
We will demonstrate in this paper, however, that it is impossible to recover structure and motion from two frames only because the information stemming from points beyond first seven is redundant.
Let us consider a rigid body consisting of seven traceable points P, Q, R, A, C, E and G. We shall assume that no four of them are coplanar. (On treatment of four coplanar points compare [7, 3]). Let us assume that their two projections are available, frame 1 with P’, Q’,
R’, A’, C’, E’ and G’ (see Fig.1), and frame 2 with P”, Q”, R”, A”, C”, E” and G” (see Fig.2). What we claim now is that if we have another point Z with its projection Z’ in the first frame, then we can draw a line in the second frame on which the projection of Z onto the second frame must lie. In other words given Z’, the point Z” has only one intrinsic degree of freedom to be located in the second frame. This means also that knowledge of the location of Z” contributes only one piece of information to the recovery of structure and motion. As each new point introduces 3df and provides 2 pieces of information in the first and only one in the second frame (=3df in all), then addition of any further point does not contribute anything to the solution of structure and motion problem.
To demonstrate the validity of our claim let us imagine that not the traced body moves but rather the projection plane and the focal point. Let be the intrinsic position of the focal point of the first frame and that of the focal point of the second frame (see Fig.3). Under this convention we define as the projection of point onto the second frame. Let us define straight lines , , , . Let us consider the plane PQR. Let , , , and be the points of intersection of lines and the plane PQR respectively. Let
, , , be projections of , , , and onto the second frame (with respect to its focal point ). Let Z be the eighth point of the rigid body and we define the line and the points - projection of on frame 1, , - projection of on frame 2, - intersection of with , - projection of on frame 2, in analogous way.
To show the validity of our claim we demonstrate first, that given in the first frame: and in the second frame: we can identify in the second frame.
Then we show that given additionally in the first frame, we can identify in the second frame. But then we have clearly identified the line which will complete the proof.
3.2.1 Basic Geometrical Facts
So let us first recall the well-known theorem on double quotient (DQ) which says the following (see Fig.5: if points A,B,C,D are collinear and A’, B’, C’, D’ are their perspective projections onto a plane (perspective projection preserves collinearity), then the following holds:
[TABLE]
This actually means the following for the operation of perspective projection onto a frame fr with a focal point F: Given three collinear points A,B,C and their (perspective) projections A’,B’,C’ onto a plane, and given a forth point D on the line AB, then we can identify the position of projection D’ of D on the line A’D’, even if we know neither the position of F nor that of the frame fr in space with respect to points A,B,C.
What is more, given four coplanar points A,B,C,D (no three collinear) together with their projections A’,B’,C’,D’, and given a point Z in the plane ABC, then we can uniquely determine the position Z’ of the projection of Z onto the plane A’B’C’, even if we know neither the position of the focal point F nor that of the frame fr in space with respect to points A,B,C,D. This is straight forward to achieve via equation (3). Let denote the intersection of lines DB and BC, let denote the intersection of lines DC and AB, Let denote the intersection of lines ZB and BC, let denote the intersection of lines ZC and AB (see Fig.4. Also, let denote the intersection of lines D’B’ and B’C’, let denote the intersection of lines D’C’ and A’B’, It’s obvious that is the projection of , and is the projection of .
Obviously, are collinear, and are collinear, and are collinear and are also collinear. Hence , the projections of resp. are easily located. Now is easily located as the intersection of lines and .
Note that double quotients and may be considered as ”coordinates” of Z in the ABCD coordination system, preserved under any sequence of perspective projections.
Furthermore, if three points A’,B’,C’ (projections of some points A,B,C) are not collinear, then there exists always a series of projections with respect to suitably chosen focal points and projection planes such that in the last frame, with , , being images of , is orthogonal to and line segments and are of unit length.
Last not least, if points A,zB,C,D are coplanar, then lines AB and CD are either parallel or they intersect.
These well known facts from elementary geometry prove very fruitful when applied to our task. Let us turn back to the situation depicted in Fi.3.
3.2.2 Locating Projected Focal Point
Let us now identify the position of the projection onto the frame 2 of the focal point of the frame 1. We know only relative positions of points relatively in the frame 1, and relatively in the frame 2. We assume that we have already transformed by a sequence of perspective projections frame 2 in such a way that line segments , are orthogonal and both of unit lengths be the origin of coordinate system, the X-axis, the Y-axis. As the sequence of projections from original frame2 to a transformed one is known a double quotient preserving, we lose no information and can always locate in the original frame 2.
Given the information, it is easy to locate points: intersection of and , intersection of and , intersection of and , intersection of and , intersection of and , intersection of and , intersection of and , intersection of and .
We can also calculate the double quotients:
[TABLE]
What are the constraints on the positions of in the frame 2 ? First of all, all the lines , , , must intersect at a single point that is at (because by definition , , , intersect at a single point that is at ). This leads us to the following equation system:
[TABLE]
We can solve the three linear equation systems (5), (6), (7) for , and from comparison of from the first two equation systems we get:
[TABLE]
Let us introduce ”coordinate” points of as follows: intersection of and , intersection of and , intersection of and , intersection of and , intersection of and , intersection of and , intersection of and , intersection of and ,
We see easily that (see Fig.6)
[TABLE]
and so forth for other auxiliary points , and .
Observe that, due to our assumption of R”,Q”,P” establishing the coordinate system, we have also (see equation (4)):
[TABLE]
Substitution of equations (9) and then (10) into equation (8) results in a polynomial equation in only two unknowns: and .
Transforming analogously (5) and (7) by first eliminating , we finally get another, independent equation in the same two unknowns: and .
Note that both are cubic in (and also ). By multiplying both equations with factors standing in front of in the other equation and then subtracting both we get a quadratic equation in , easily solved symbolically. It is easily observed, that one of the solutions would always be the degenerate solution (meaning a collapse of B”,D”,F”,H” and onto P”), so we will always take the other one (just having a unique solution at that moment).
Then we substitute this symbolic result into one of the two (”cubic”) equations substituting for and getting a one variable polynomial equation in , solvable by conventional methods.
In a simulated experimental setting we had observations:
[TABLE]
[TABLE]
We got one of solutions =1.43 and ,
The behavior of the final polynomial in was as follows:
[TABLE]
3.2.3 Locating line z’
If we knew now the position of the point Z’ (projection of Z) in the first frame, we could calculate proper double quotient in frame 1, find the projection of intersection point of and onto frame 2 and then draw the line connecting this point with , which is just the line we were looking for. Q.E.D.
3.2.4 Freedom of Shape of the Identified ”Rigid” Multipoint Object
The results of the previous subsection mean that no matter how many points are given in two images, it is always possible to find countless fittings of frames yielding different objects (not only by size, but also by shape!) that may be source of of both projections. As we stated earlier, there are four degrees of freedom unusable by any fitting procedure. We will show below the effects of two of these degrees of freedom only, because imagination of the other two is far more complicated.
Assume that we fit together two frames with n point correspondences each. E.g. A’ and A” are projections of a point A in space (see Fig.7). Let be the line joining focal points (in 3D space). Obviously, points , , and and line are coplanar. is the intersection of and . Now let us ”move” along to another location, say (The projection frame 1 is left as it was in space). Obviously, points , , and are also coplanar, and and (most probably) intersect at a point . The same happens with the other (n-1) traced points: after moving the ”rays” starting at and and passing through traced points in frame 1 and frame 2 resp. still intersect, but at different points. That is, another 3D object may also have given the same two projections. And this new object is usually pretty different from the previous one.
Very same manipulations can be done ”moving” along , giving other different 3D objects.
4 Exploiting and Preventing Information Redundancy
As in case of orthogonal projection, we may put now the question what happens with the one piece of information of the eighth point left unused. Let us consider what this info means geometrically. Given the first seven points, for each further point, if we know its image in the first frame, we can identify the line on which it lies in the second frame. This means a point Z with its image Z’ in the first frame must have its image lying on a concrete line z” in the second frame. But if Z” does not lie on the pre-specified line? Than two things may have happened. Either Z is not a part of a rigid body containing P, Q , R, S, T, U, W , or …the identities of P”, Q”, etc. have been assigned incorrectly.
But the latter means that if we have a set of projected points S1 and a set of projected points S2 of which we know that they are projections of a set of points belonging to a rigid body, but the identities are not ascribed, then we may be capable of assigning identity relations among points of the set S1 and the set S2. For this purpose we may select eight points from the set S1 and try allocating to them points of the set S2. In all, if n is the cardinality of the set S2 (equal to the cardinality of the set S1) we may have to try combinations of points. (In case of n=8 we have 8! combinations). First seven points are then used to identify the line on which the eighth point should lie in the second frame, and the distance between the line and the real position of projected point will be used to evaluate the goodness (or in fact the badness) of fit. The identity assignment minimizing the distance may be considered as the best. It is, however, easily seen that the task may be prohibitive. It is advisable to use additional information (e.g. substructures of visible connections between points) to bind the complexity.
Under these circumstances the question seems to be justified to what extent the geometry of the optical system must be known in order to enable recovery of structure and motion from two frames. In [5] it has been demonstrated that the knowledge of exact position the focal point relatively to the projection plane imposes the following requirement on the balance of degrees of freedom and the amount of information:
[TABLE]
With p=5 points and k=2 frames we get . Papers [10, 11] deal with recovery in that case.
Can we weaken the geometrical requirements ? First let us consider the case where the relative position of projection plane and the focal point is unknown, but fixed. We have then that each point of the body introduces 3 df in the first frame minus one df for the whole body as there exists no possibility of determining the scaling of the whole body. Additionally we have 3df due to the uncertainty of the location of the focal point. With p points forming the rigid body traced over k frames we have degrees of freedom. To achieve the balance we require:
[TABLE]
If we fix k at level 2, then we require meaning . Hence we get the same trouble with the seven-point-limit.
Now what if we know the relative position of the focal point and of the projection plane up to the distance between them (that is that the focal point may only move towards and away from the projection plane along a fixed axis, a requirement fulfilled by typical modern cameras with unsupervised autofocusing) ? Each point of the body introduces 3 df in the first frame minus one df for the whole body as there exists no possibility of determining the scaling of the whole body. Additionally we have 1df due to the uncertainty of the location of the focal point. With p points forming the rigid body traced over k frames we have degrees of freedom. To achieve the balance we require:
[TABLE]
If we fix k at level 2, then we require meaning . In this case we have just met the seven-point limit.
5 Discussion
In this paper we have demonstrated that for perspective projection of rigid bodies in some situations at least three frames are necessary to recover structure and motion. From a degrees-of-freedom argument it became visible that the amount of information that two frames with eleven traced points may provide enough information to recover structure and motion from two frames. However, it has been demonstrated that this is impossible because the rigid body assumption imposes internal dependence between the point projections so that information provided by the eighth point and any further traced point cannot be consumed for purposes of recovery of structure and motion.
We need to stress that purely geometrical properties of ”points” have been considered. In practical settings we have generally to handle errors in positioning points in the frame raster. If we now assume that there exists a (possibly stochastic) dependence between measurement errors and the distance between (at least some) observed points and the camera, then we may have a clue how to recover the distance object-camera and may overcome the phenomenon demonstrated in this paper. But if the error of measurement does not depend on the distance from the camera, but on other factors, then there exists no possibility to recover the complete set of structure and motion parameters from two frames under perspective projections (from purely geometrical point-dependent clues).
Instead, eight or more points over two frames may solve identification problem of points between consecutive frames or alternatively the problem of belonging to the same rigid body. That is, in the first case, if we have two frames with 8 (or more) points each and we know that these points belong to the same rigid body, but we do not know the exact point to point correspondence, then we can exploit the unused information (not consumable for recovery of structure and motion) for purposes of identification of point-to-point correspondences. Alternatively, in the second case, when we have sets of points in two frames where the point-to-point-correspondence between frames is known, then we can exploit the unused information (not consumable for recovery of structure and motion) to decide, which points belong to the same rigid body.
It is worth mentioning at this point that several papers claimed possibility of recovery of structure and motion from two frames (using less then 7 points) [10, 11, 13, 12]. It must be stressed that in those papers the complete knowledge of geometry of the optical system is assumed. In that case, clearly, the number of degrees of freedom is different and statements about the necessary number of points and frames are different. In [5] we have demonstrated that the structure and motion recovery method proposed in [12] (two frames, four points and a line) is not correct due to unbalanced degrees of freedom. On the other hand, under such conditions, recovery for five points and two frames [10, 11] is possible.
Though recovery of structure and motion of rigid bodies consisting only of (a limited) number of traced points may seem to be a simplistic task, it is still of practical relevance. E.g. we live in times of rapidly growing image databases especially in criminology. Review by hand of such databases may prove prohibitive and hence some clues restricting the search space significantly are of importance. Claims have been raised that some simple measurements of spatial structure of a few points on the surface of face may be sufficient to identify the suspect. The important question is then: how many images (e.g. from a video camera of a bank security system) are needed, and how many features are to be traced to recover the 3D structure of points of interest. This study demonstrates the impact of knowledge of geometrical structure of the optical system. Strict knowledge allows to recover the structure from 2 images and 5 traced points [9]. If we know the geometry up to the distance image plane
- focal point , then we can still work with 2 images, however with 7 traceable points. And if we are totally ignorant of the geometry , at least 3 images are required - with 7 traced points.
6 Conclusions
- •
It is impossible to recover structure or motion from two frames whatever number of traced points is available, if there is complete uncertainty about relative position of projection plane and focal point from frame to frame. For recovery at least 3 images are needed. The same is true even if this relative position does not change over time but is unknown.
- •
If a rigid body consists of at least eight points, then we can solve the problem of point tracing for any two consecutive frames alone from knowledge which points of two frames belong to the body (without explicit knowledge of point-to-point correspondence)
- •
Alternatively, if a rigid body consists of at least eight points, then we can solve the problem of belonging to a rigid body for any two consecutive frames alone from explicit knowledge of point-to-point correspondence.
- •
If we know the geometry of the optical system up to the distance image plane - focal point (e.g. from a camera with autofocusing), then we can recover structure and motion work with 2 images, however with 7 traceable points.
- •
Strict knowledge of the geometry of the optical system allows to recover the structure from 2 images and 5 traced points or 3 images and 4 points [5].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Azarbayejami, A.P. Pentland: Recursive estimation of motion, structure and focal length, IEEE Trans. PAMI , 17(1995)562-575.
- 2[2] Kłopotek M.A.: Physical space in reconstruction of moving curves, [in:] Proc. National CIR’89 (Cybernetics, Intelligence, Development) Conference, Siedlce (Poland) 18-20.9.1989, Vol. I, 55-71 (1989)
- 3[3] Kłopotek M.A.: 3-D-Shape reconstruction of moving curved objects, [in:] V. Miszalok Ed.: Med Tech’89 Medical Imaging , Proc. SPIE 1357, 29-39 (1990)
- 4[4] Kłopotek M.A. A simple method of recovering 3D-curves from multiframes, Archiwum Informatyki Teoretycznej i Stosowanej , Vol.4, No. 1-4, 103-110 (1992).
- 5[5] Kłopotek M.A.: A comment on ”Analysis of video image sequences using point and line correspondences”, Pattern Recognition Vol. 28 No. 2, pp. 283-292, 1995
- 6[6] Kłopotek M.A.: Distribution of Degrees of Freedom over Structure and Motion of Rigid Bodies, Machine Graphics & Vision , Vol 4 No 1/2, pp. 83-100 (1995)
- 7[7] Kłopotek M.A. : Reconstruction of 3-D rigid smooth curves moving free when two traceable points only are available) Machine Graphics and Vision , Vol. I, nos 1/2, 1992, 392-405
- 8[8] Lee C.H.: Interpreting image curve from multiframes, Artificial Intelligence 35(, 145-164 1988)
