Augmented Reality using Tango

What is Tango?

Tango is a technology platform developed and authored by Google that uses computer vision to enable mobile devices, such as smartphones and tablets, to detect their position relative to the world around them without using GPS or other external signals. This allows application developers to create user experiences that include indoor navigation, 3D mapping, physical space measurement, environmental recognition, augmented reality, and windows into a virtual world.

(Source: https://en.wikipedia.org/wiki/Tango_(platform))

Four devices are supporting the Tango technology at the moment we’ve written this article:

The Yellowstone tablet (Project Tango Tablet Development Kit) a 7-inch tablet with full Tango functionality, released in June 2014 (This is the device we are using in our showcase).
The Peanut was the first production Tango device, released in the first quarter of 2014.
Lenovo’s Phab 2 Pro is the first smartphone with the Tango technology, the device was announced at the beginning of 2016.
Asus ZenFone AR is the world’s first 5.7-inch smartphone with Tango and Daydream by Google.

The Idea behind working on a showcase using Tango was to learn more about this smart and powerful device.

The showcase is consisting of two requirements:

Place a virtual Modeso logo on a real-world surface (e.g. wall)
Make the virtual object — the Modeso logo in our case — not cover real world objects in front or rather occlude these

Result demo of first requirement in the Swiss office

‍

Second Requirement

From the first requirement we already know the depth on the logo in the 3d world. With the help of depth perception and the provided cloud points we can filter these points and get the subset of points with a depth value less than the depth of the logo putting in mind the quaternion of the logo.

Using the Tango update listener we can use the callback onXyzIjAvailable. This callback is invoked when new cloud point data gets available from Tango. Important to know is, that this callback is not running on the main thread.

Google issued a warning about these callbacks. Meaning you have to be very restrictive while working inside the callback, because you won’t receive new cloud data until you have returned from the callback.

So, for example if you are working on heavy stuff inside the callback it will affect the performance of your readings.

Every time we receive new cloud points from the Tango we filter these points depending on the the depth of the logo and update the Rajawali renderer with the new data.

Belal covered with the virtual Modeso logo and rendered cloud points

The virtual Modeso logo in front of an iPad, which was detected by Tango

The second step is to show a real object in front of the model to get the occlusion of real-world objects in front of the logo.

In order to make this possible the following three steps are required:

Get depth matrix of the model
Get intersection points
Add mask to intersection points

We used the Rajawali method calculateModelMatrix from the ATransformable3D class which takes the matrix of the current point cloud matrix. With the result points we could add a mask to it.

We were lucky to stumble upon https://github.com/stetro/project-Tango-poc which was a great reference and gave us the right idea on how to implement the masking.

Unmasked hand in front of the flipped virtual Modeso logo on the wall

Approaches

Here we will go a little bit deeper into the different approaches used and proposed for achieving the target goal.

Depth Mapping & 3D Reconstruction

This is the technique used by the repo at https://github.com/stetro/project-Tango-poc and it is based upon depth mapping. In short a depth map is a regular image (occasionally a grey scaled image) that contains information related to the distance of the surfaces of scene objects in the image from a specific viewpoint/camera. The color of each pixel of the depth image represents the distance (depth) of this pixel in real world from the camera e.g. in gray scale depth images dark areas represent closer points to the camera while lighter areas represent further points (or the reverse according to how you encoded the image)

How will depth maps help us achieving the goal?

If we have a depth map for our camera view i.e. if we know the depth of each pixel of the camera view we can perform what is called the “Depth Test” or “Z-Buffering” in Computer Graphics, in this algorithm, while rendering your 3D content, the hardware compares the value of each pixel of the rendered 3d content to the corresponding pixel in the depth map and decides whether it should draw this pixel or not according to whether it is occluded i.e, there is another object at the same pixel closer to the camera, or not. Basically that’s the idea.

How this is done in code?

The first step is that we need to compute the depth map of our view using Tango color camera to achieve that we have to:

1 Initialize the Tango service as documented using the C API

https://developers.google.com/tango/apis/c/

2 Configure the Tango device to use the color camera:

	ret = TangoConfig_setBool(Tango_config_, "config_enable_color_camera", true);
	if (ret != TANGO_SUCCESS) {
	LOGE(
	"config_enable_color_camera() failed with error"
	"code: %d",
	ret
	);
	...
	}

view raw useColorCamera.c hosted with ❤ by GitHub

3 Configure the Tango device to use depth camera:

	ret = TangoConfig_setBool(Tango_config_, "config_enable_depth", true);
	if (ret != TANGO_SUCCESS) {
	LOGE(
	"config_enable_color_camera() failed with error"
	"code: %d",
	ret
	);
	...
	}

view raw useDepthCamera.c hosted with ❤ by GitHub

4 Configure the Tango device to connect local callbacks for different feeds from the device:

	// Enable depth sensing
	ret = TangoService_connectOnXYZijAvailable(OnPointCloudAvailableRouter);
	if (ret != TANGO_SUCCESS) {
	LOGE(" could not enable depth sensing code: %d", ret);
	...
	}

	// Enable image sensing
	ret = TangoService_connectOnFrameAvailable(TANGO_CAMERA_COLOR, this,
	OnFrameAvailableRouter);
	if (ret != TANGO_SUCCESS) {
	LOGE(" could not enable image sensing code: %d", ret);
	...
	}

view raw connectLocalCallback.c hosted with ❤ by GitHub

5 In the “OnFrameAvailableRouter” callback, which is called when a new camera frame arrives and the frame image buffer is passed to it, we construct our depth image from the camera image:

	void OnFrameAvailableRouter(void context, TangoCameraId id, const TangoImageBuffer buffer) {
	if (buffer->format != TANGO_HAL_PIXEL_FORMAT_YCrCb_420_SP) {
	return;
	}
	...

	LOGD("getting color grame");
	int yuv_width_ = buffer->width;
	int yuv_height_ = buffer->height;
	int uv_buffer_offset_ = yuv_width_ * yuv_height_;
	int yuv_size_ = yuv_width_ * yuv_height_ + yuv_width_ * yuv_height_ / 2;
	std::vector <uint8_t> yuv_temp_buffer_;
	yuv_temp_buffer_.resize(yuv_size_);
	memcpy(&yuv_temp_buffer_[0], buffer->data, yuv_size_);

	cv::Mat rgb = cv::Mat(yuv_width_, yuv_height_, CV_8UC3);
	cv::Mat scaled_rgb = cv::Mat(640, 360, CV_8UC3);
	cv::Mat scaled_rgb_grayscale = cv::Mat(640, 360, CV_8UC1);

	for (size_t i = 0; i < yuv_height_; ++i) {
	for (size_t j = 0; j < yuv_width_; ++j) {
	size_t x_index = j;
	if (j % 2 != 0) {
	x_index = j - 1;
	}

	size_t rgb_index = (i * yuv_width_ + j) * 3;
	uint8_t red;
	uint8_t green;
	uint8_t blue;
	Yuv2Rgb(yuv_temp_buffer_[i * yuv_width_ + j],
	yuv_temp_buffer_[uv_buffer_offset_ + (i / 2) * yuv_width_ + x_index + 1],
	yuv_temp_buffer_[uv_buffer_offset_ + (i / 2) * yuv_width_ + x_index],
	&red, &green, &blue);

	rgb.at<cv::Vec3b>(j, i)[0] = red;
	rgb.at<cv::Vec3b>(j, i)[1] = green;
	rgb.at<cv::Vec3b>(j, i)[2] = blue;
	}
	}
	resize(rgb, scaled_rgb, scaled_rgb.size());
	cv::cvtColor(scaled_rgb, scaled_rgb_grayscale, CV_BGR2GRAY);

	cv::Mat scaled_gray = cv::Mat(640, 360, CV_8UC1);
	resize(depth, scaled_gray, scaled_gray.size());
	LOGD("converted YUV to RGB frame");

	// filtering ...
	LOGD("filtering ...");
	cv::ximgproc::guidedFilter(scaled_rgb_grayscale, scaled_gray, scaled_gray, 13, 0.05);

	// 320x180 pixel array with depth coordinates
	LOGD("create pointcloud ...");

	float fx = static_cast<float>(depth_camera_intrinsics_.fx) * 2;
	float fy = static_cast<float>(depth_camera_intrinsics_.fy) * 2;
	float cx = static_cast<float>(depth_camera_intrinsics_.cx) * 2;
	float cy = static_cast<float>(depth_camera_intrinsics_.cy) * 2;

	vertices.clear();

	for (int x = 0; x < 640; ++x) {
	for (int y = 0; y < 360; ++y) {
	int depth_value = scaled_gray.at<uint8_t>(x, y);


	float Z = ((float) depth_value * (float) kMaxDepthDistance) /
	((float) UCHAR_MAX * (float) kMeterToMillimeter);
	float Y = (y - cy) / fy * Z;
	float X = (x - cx) / fx * Z;


	glm::vec3 vector(X * 0.9, Y * 1.2, Z);

	vertices.push_back(vector.x);
	vertices.push_back(vector.y);
	vertices.push_back(vector.z);
	}
	}

	LOGD("DONE");
	}

view raw processDepthImage.c hosted with ❤ by GitHub

But what are we doing inside this callback exactly?

First we are converting the received image buffer (TangoImageBuffer) format from YUV color space (default) to RGB color format. This is done using the YUV2RGB function for each pixel.

Next we construct an OpenCV Matrix (image) over the resulting RGB image from the first step and create a gray scaled version of it.

After that we apply a GuidedFilter to the OpenCV image. Guided filter is an edge-preserving smoothing filter. See here for more. In effect this filtering smoothes the occluded parts of the 3d object so it looks more real. The resulting filtered gray scaled image is used as our Depth map.

Based on this depth map we will try to construct a 3D representation of the camera view. We now have the z coordinate of each pixel in the camera view and need to know the other two coordinates (X, Y). Here Tango supports us providing an equation to translate 2d coordinates to 3d ones and vice versa using the camera intrinsics.

Given a 3D point (X, Y, Z) in camera coordinates, the corresponding pixel coordinates (x, y) are:
x = X / Z * fx * rd / ru + cx
y = Y / Z * fy * rd / ru + cy

After solving the previous equation for X and Y we now have the three components X,Y and Z. So now we can construct a 3D points version of the camera’s real view. For that we construct a vector of vertices and fill it with the data as last required step.

With that all in place we can perform the “Depth Test” in OpenGl as we have the 3d model to render and a 3d representation of the camera view. To render parts closer to camera and occlude the further ones we only need to check them against each other:

	void Scene::Render(....) {
	glEnable(GL_DEPTH_TEST);
	glClearColor(1.0f, 1.0f, 1.0f, 1.0f);
	glClear(GL_DEPTH_BUFFER_BIT \| GL_COLOR_BUFFER_BIT);
	...

	// draw your 3d content
	//Draw the 3d points representation of the camera view (vertices vector)
	glDisable(GL_DEPTH_TEST);
	// Render regular camera view out
	...

	}

view raw Render.c hosted with ❤ by GitHub

This is what you will get:

Gray scaled depth image (darker means closer)

Conclusion

The results were not as accurate as expected to proceed with the targeted scenario.

We have noticed that the Tango is heating up very heavy while using it for more than five minutes. Google is mentioning this problem already. Area learning is using heavy processing power causing the processor to heat up and to protect the processor the device reduces the processor speed which has a negative effect on the readings and the data produced by the Tango sensors in general.

Furthermore, the object detection is very imprecise, the camera is not yet good enough to produce good results from image processing. Even with a better camera the image processing would be very expensive in processing power and would impact overall performance.

On the other hand it’s very good with augmented reality in general and placing objects on walls or floors without the need for markers. In direct comparison with vuforia and it’s markers it’s the winner without questioning.

But for real world real time object occlusion it seems to still miss needed functionality.

Credits: Modeso’s Mobile Engineers Belal Mohamed, Mahmoud Abd El Fattah Galal & Mahmoud Galal.

‍