Validation of an IMU-camera fusion algorithm using an industrial robot Validación

The integration of down-looking camera with an in-ertial measurement unit (IMU) sensor makes possible to provide a lightweight and low-cost pose estimation system for unmanned aerial vehicles (UAVs) and micro-UAVs (MAVs). Recently, the authors developed an algorithm for IMU and exteroceptive sensor fusion ﬁlter for position and orientation estimation. The aim of the estimation is to be used in the outer control loop of an UAV for position control. This work presents an experimental set up to test that algorithm using an industrial robot to produce accurate planar trajectories as a safe alternative to testing the algorithm on real UAVs. The results of the IMU-camera fusion estimation for linear positions and linear velocities show an error admissible to be integrated on real UAVs.


I Introduction
The autonomous navigation of unmanned aerial vehicles (UAVs) in GPS-denied environments is currently possible by using controllers that employs an on-board computation of the feedback signals of the vehicle pose (position and orientation) using multiple sensors -e.g., IMUs (Inertial Measurement Unit), cameras (monocular, stereo, and RGB-D), and/or laser range finders-combined with efficient algorithms for data fusion that together minimize the uncertainty of the individual sensor measurements.
Concerning the state estimation for UAVs and MAVs, Kumar and Michael (2012) highlighted that the integration of camera sensors with IMUs enable a lightweight alternative to designs based on laser range finders or RGB-D cameras (that provide dense and rich three-dimensional information but have a trade-off with payload constraints and computational resources required). With the aim to (i) use low-weight and low-cost sensors, (ii) use efficient real-time algorithms to reduce computational resources, (iii) use on-board computation to avoid the low-latency and possible interruptions of the communication for off-board computation, and (iv) reduce the energy consumption to increase the autonomy, the authors developed a sensor fusion algorithm (PerezPaina et al., 2017) for IMU and camera sensors that is further experimentally evaluated in this work.
In order to intensively test a sensor fusion algorithm, the first option is to use a real UAV but involves a risk of damage if the algorithm fails while the flight takes place. The main idea of the present work is to test the algorithm on-line but in a detached way from the real UAV's flights. So it is necessary to generate an accurate reference motion using another facility, in this case, an industrial manipulator is chosen; see  In this work, the proposed experimental set up is to generate motions for an IMU and a down-looking camera mounted on the end-effector of a robot manipulator to emulate the signals emitted by an UAV in motion. The measurements of the IMU and the down-looking camera are fused using the algorithm proposed PerezPaina et al. (2017). The aim of this work is to validate the sensor fusion for a motion parallel to the ground plane, including linear position and linear velocity, to be used in an UAV on-board controller, as the proposed Pucheta et al. (2016). The full pose estimation will be tackled in future work. The experiments show that the errors in the measurements are admissible to be integrated in the real system.
The article is organized as follows. Section II introduces the description of the robot manipulator used to generate the reference motion. Then, it describes the selection of the fiducial visual marker used to estimate the pose of the down-looking camera. Next, the rigid transformations used in the IMU-camera sensors are detailed. Section III briefly reviews the fusion estimation algorithm. Section IV describe the implementation and the obtained results for two different paths, and section V presents the conclusions and future work.

A. Welding manipulator
An industrial robot manipulator, CLOOS Romat 310, is used in this work to generate accurate trajectories. It is available from the Mechanical Engineering's Department of the FRC-UTN. This robot is a six-degrees-of-freedom serial anthropomorphic arm and is equipped with the ROTROL control system and a proprietary programming language named CAROLA. With respect to the average size of the welding robots existing in the market, its workspace is comparatively large (it is often combined with an additional translational axis to weld large mechanical parts). Its workspace is large enough to include reference trajectories for UAV calibration purposes.
The robot can be programmed by teaching some way-points to approximate the end-effector near the working volume, and from that point, several trajectories can be defined using geometric templates referenced to a predefined coordinate system. For instance, a program can be referenced to the home position, to a used-defined frame of reference, or can be relative to a frame located at the base of the robot; in this work the latter frame is chosen. The end-effector pose is defined by a 6-tuple of coordinates, 3 for the Cartesian positioning ( , , ) with respect to a frame located at the base of the robot and 3 for the orientation ( , , ) relative to the home frame orientation. Because its welding capabilities, the programming language allows the user to define many characteristics for the interpolation of motions, several ways to approach the sharp corners of the trajectories in a controlled form, avoiding overlapping or oscillation through the target trajectories. This makes the robot an valuable research platform to generate accurate free-form paths in SE(3).

B. Visual marker for pose estimation
A fiducial visual marker on the ground is used to obtain an estimation of the absolute pose of the down-looking camera. To detect and track this marker is easier than to employ natural features. Visual marker with geometric forms like square are better than other shapes because they have four remarkable points (corners) that can be used to determine the pose with respect to the camera, while inner region can be used for identification. Different methods are used for identification and binary patterns are the most usual. Among the predefined pattern for visual markers available in the literature, the ArUco dictionary of markers (Garrido-Jurado et al., 2014), which maximize intermarker distance is chosen. This allow to reduce the size of the dictionary to the minimum needed, increasing the speed of detection, when more than one marker is used.
The ArUco algorithm is integrated with the OpenCV library (Bradski & Kaehler, 2008) to compute the rigid transformations between the camera and the maker frame systems.

C. Reference frame systems for the used setup
The reference systems of the used setup can be seen in Fig. 2, where {i} is the inertial or navigation frame attached to the ground visual marker, {b} the body frame (coincident with the IMU frame), and {c} the camera frame.
The translation vector ∈ ℝ 3 and the orientation matrix ∈ 3 relating the navigation and the camera frames, are estimated by the visual algorithm. The translation vector ∈ ℝ 3 and the orientation matrix ∈ 3 relating the navigation and the body frame, are estimated by the fusion filter. Last, the relative pose given by the vector ∈ ℝ 3 and the matrix ∈ 3, relating the camera and body frames, is a known data measured by hand. As can be seen in Fig. 2, this relative pose is given only by a translation in the z coordinate and therefore = .
Figure 2: Three frame systems and their rigid transformations represented by curved arrows: {c} is the frame system located at the optical center of the camera, {i} represents the inertial or navigation frame system, and {b} is the frame with unknown pose to be referred with respect to {i}. For each frame, local arrows in red, green, and blue color represent the x, y, and z axes, respectively.

D. Reference frame transformation
Both the relative position and orientation between the camera reference system {c}, and the reference system of the ground marker attached to inertial frame {i}, can be written as ∈ ℝ 3 and ∈ 3, respectively (Ma et al., 2010). For vector , the sub-index ci indicates the starting {c}, and target {i}, reference systems. The supraindex indicates the frame in which the coordinates of this vector are expressed. Concerning the convention for Gonzalo Perez-Paina, et al.
As was stated, using the visual algorithm the pose given by and is obtained, for which the inverted transformation needed by the fusion filter and is computed. In order to compute the elements of the unknown pose, and , a geometric analysis is performed. Vector addition of relative positions can be formulated as where, and respectively represents the same vectors as (although in different sense) and but referred to the coordinate system {i} to make the sum consistent. This pair of positions can be obtained by using the change of coordinates matrix , which is the inverse (and also the transpose for being orthogonal) of , computed by the detection algorithm. Then, (1) can be written as Provided that (1) can also be written as On the other hand, Fig. 2 where, clearly, the matrix that represents the orientation of the frame system {b} with respect to the frame system {i} is the same that produces the change of basis for a vector from {b} to {i}. Therefore, the orientation of frame {b} with respect to frame {i} can be written as = or, using the data supplied by the algorithm described by Garrido-Jurado et al. (2014), as = ( ) .
(8) Figure 3 shows real setup, similar to the schematic view in Fig.2, indicating the three reference system evolved, {i}, {c}, and {b}. The 3D printed sensor holder, which rigidly attach the camera and IMU (fixed to the manipulator end-effector) can be seen in the zoom area.  Figure 3: Yellow and orange arrows represent relative position and orientation between different systems. Red, green, and blue arrows represent the x, y, and z axis, respectively.

III. IMU-Camera fusion
The implemented sensor fusion algorithm is based on the Error-State Kalman Filter (ESKF), which performs a loosely coupled sensor fusion (Madyastha et al., 2011;Solà, 2015). The inertial unit is composed of a three axis accelerometer and a three axis gyroscope. The implemented filter is based on the described by Perez Paina et al.

A. Error-state Kalman Filter
In the ESKF, the state is expressed as a nominal value plus an error or perturbation, = ⊕ δ , where the symbol ⊕ (composition) is an ordinary addition for position and velocity, and a quaternion product for the orientation. The ESKF performs the nominal state propagation while estimating the error state used to correct this integration.
The formulation of the ESKF is based on the system stochastic model, given by ̇= ( , , ) where the state has to be expressed in terms of the nominal and the error parts. The nominal state = [ ] is composed of the position , the linear velocity , and the orientation expressed as a unit quaternion , all with respect to an inertial frame. The inertial measurements are used as input in the process model (9), with = [ ] . On the other hand, the exteroceptive sensor measurement of (10), given here by a visual monocular pose estimation, is composed of the absolute position and orientation, i.e. = [ ] . The prediction stage of the ESKF filter is given by where (11) is obtained from the discretization of (9) and is used to integrate the nominal state from the inertial measurements. Equation (12) propagates the covariance matrix representing the estimation uncertainty and incorporates the sensor noise of (9), ∼ ( , ), which has also to be obtained from the continuous-time model. On the other hand, the update stage of the ESKF is given by which uses the camera pose measurement in (14), and the sensor model given by (10). Equation (13) incorporates the measurement noise of (10), ∼ ( , ).
As usual, the filter orientation error is denoted in its minimal representation, avoiding redundant parameters that can produce singularities when applying the needed additional constraints. In the context of the Kalman filter it is also known as a multiplicative error model (Markley, 2004;Markley, 2003). Thus, the filter estimation error is where for a small orientation error, can be expressed as with δ being the minimum orientation error representation. Then, the error is used to compute the covariance matrix = [ ], which describes the uncertainty in the filter estimation required in (12), (13), and (15).

B. Process model based on inertial sensor measurements
The kinematic model used for integrating the inertial measurements is applied in the prediction stage of the estimation filter. For the Kalman filter implementation, the discrete-time kinematic model is derived from the continuous-time model, which for the case of the error state Kalman filter has to be separated in: (1) model for the nominal state integration, and (2) model for the uncertainty propagation.
where ( ) is the matrix representation of the orientation given by the unit quaternion , is the gravity acceleration expressed in the inertial frame, the symbol ⊗ represents the Hamilton product (Trawny and Roumeliotis, 2005) where the first quaternion component is the scalar value.
The accelerometer and gyroscope inertial sensors model are where the subscript stands for measured value, which are composed of the true value plus an additive Gaussian noise ∼ ( , 2 ) and ∼ ( , 2 ). By replacing (21) in (19) and (22) in (20), the continuous-time stochastic model of (9) is obtained, which gives ̇= , Then, the nominal state evolution is given by The derivation of (26) to (31) follows Solà (2015) and Trawny & Roumeliotis (2005). As can be seen, the error state model of (29) to (31) is linear, which can be expressed as δ̇ with ≡ ( ).
2) Discrete-time model: In order to obtain the discrete-time model, here it is proposed to use the Euler integral solution of (26)-(31) for the nominal state and uncertainty propagation, respectively. The discrete-time evolution of the nominal state (omitting the subscript n), is then given by where Δ = − −1 and is a 4× 4 skew-symmetric matrix. Therefore, the state transition matrix of (12) results where is given by (33). The discrete-time covariance matrix of the process noise is given by Simon (2006) A detailed representation of the sub-matrices of (38) can be found in Perez .

C. Observation model based on camera measurements
The measurement vector of the visual pose estimation algorithm is composed of the position and orientation with respect to the inertial frame, i.e. = [ ] . Hence, the measurement function is = ℎ( ) + , where ∼ ( , ) is the additive Gaussian noise. In order to compute the filter innovation , it is necessary to express the orientation using an unit quaternion, such that = ⊖ ̂− can be computed as an ordinary subtraction for the position and using the quaternion product for the orientation. Given the nominal predicted state, the measurement prediction is ̂− = ℎ(̂−), and the innovation can be computed as and considering a small orientation error  DOI: https://doi.org/10.33414/rtyc.37.101-111.2020-ISSN 1666 then, the minimal representation of the innovation is δ = [δ δ ] . On the other hand, the Jacobian matrix of the measurement model, used in (13) and (15), has to be defined. Given the visual algorithm providing absolute position and orientation measurements, this matrix results The covariance of the camera measurements, used in (13), is = diag( , ), which is composed of the covariance matrix to describe the uncertainty = 2 in the position and = 2 for the orientation expressed with Euler angles. For a more complex sensor, the measurement Jacobian matrix can be obtained as explained in Solà (2015).

A. Implementation
The experimental results were obtained using the following hardware equipment: inertial measurement unit from Microstrain model 3DM-GX1 where raw data was used, camera from The Imaging Source model DMM22BUC03 (USB 2.0, 640x480 px), with optical lens also from The Imaging Source model TBN3.5C-3MP (low distortion board lens M12x0.5mm, focal length 3.5mm). The fusion filter was implemented in C++ programming language using the computer vision library OpenCV version 3.3.0, with the contribution module (also v3.3.0) for the ArUco algorithm implementation, and the Eigen library v3.3 for the filter matrix operations.

B. Results
The experimental evaluation of the fusion filter was carried out using two different paths at constant linear velocity: a circle, and a square one. Figure 4 shows the estimation result with the circular path. In Fig. 4a the path can be observed, the red line represents the real path generated by the robot manipulator, and the blue one is the estimation given by the fusion filter, whereas Fig. 4b shows the position in meters and the linear velocity in meter per seconds against the time given in seconds. Figure 5 shows the same as Fig. 4 but for the square path.
As it can appreciated in the top view of the -path given in Fig. 4a and 5a, the estimation presents more noise near the = 0 and = 0 planes. When approaching these particular points, the used visual algorithm gives an orientation estimation which oscillates between both sides of or axes. Then, when the camera measurement is inverted to obtain the pose with respect to the navigation frame, the orientation oscillation affects more the error in the translation estimation. The elimination of these effects will be the focus of next research.

Conclusions
This work presented preliminary results of experiments achieved to test a fusion algorithm to integrate the measurements of a down-looking camera with an inertial measurement unit sensor, which will be take part of the full pose feedback signal of a UAV control system. An industrial robot was used to generate planar trajectories at constant linear velocity, a circle and an square, to test the algorithm for position estimation. The results for each task show position and velocity estimation errors admissible to be integrated on real UAVs. Next experiments will include the orientation estimation to test the full pose estimation. It is also expected that the level of noise in the real UAV will be higher than the produced by the robot.