Box: Bringing image recognition and OCR to cloud content management
Editor’s note: In this guest editorial by Box’s Senior Director of Product Management, Ben Kus tells us how they used Google Cloud Vision to add a new level of image recognition to Box.
Images are the second most common and fastest growing type of file stored in Box. Trust us: that’s a lot of images.
Ranging from marketing assets to product photos to completed forms captured on a mobile device, these images are relevant to business processes and contain a ton of critical information. And yet, despite the wealth of value in these files, the methods that organizations use to identify, classify and tag images are still mostly manual.
Personal services like Google Photos, on the other hand, have gone far beyond simply storing images. These services intelligently organize photos, making them easier to discover. They also automatically recognize images, producing a list of relevant photos when users search for specific keywords. As we looked at this technology, we thought, "Why can't we bring it to the enterprise?"
The idea was simple: find a way to help our customers get more value from the images they store in Box. We wanted to make image files as easy to find and search through as text documents. We needed the technology to provide high-quality image labeling, be cost-effective and scale to the massive amount of image files stored in Box. We also needed it to handle thousands of image uploads per second and had to ensure that users actually found the image recognition useful. But we didn't want to build a team of machine learning experts to develop yet another image analysis technology—that just wasn't the best use of our resources.
That's where Google Cloud Vision came in. The image analysis results were high-quality, the pay-as-you-go pricing model enabled us to get something to market quickly without an upfront cost (aside from engineering resources), and we trusted that the service backed by Google expertise could seamlessly scale to support our needs. And, since many of the image files in Box contain text—such as licenses, forms and contracts—Cloud Vision’s optical character recognition (OCR) was a huge bonus. It could even recognize handwriting!
Using the Google Cloud Vision was straightforward. The API accepts an image file, analyzes the image's content and extracts any printed words, and then returns labels and recognized characters in a JSON response. Google Cloud Vision classifies the image into categories based on similar images, analyzes the content based on the type of analysis provided in the developer's request, and returns the results and a score of confidence in its analysis.
To securely communicate with Google Cloud Vision, we used the Google API Client Library for Java to establish an HTTPS connection via our proxy server. The simplest way to do this is to modify the JVM's proxy settings (i.e., https.proxyHost and https.proxyPort) and use Java's Authenticator class to provide credentials to the proxy. The downside of this approach is that it affects all of your outgoing connections, which may be undesirable (i.e., if you want other connections to not use the proxy). For this reason, we chose to use the ApacheHttpTransport class instead. It can be configured to use a proxy server only for the connections that it creates. For more information, see this post.
To access Google Cloud Vision, you need credentials—either an API key or a service account. Regardless of which credentials you use, you'll want to keep them secret, so that no one else can use your account (and your money!). For example, do not store your credentials directly in your code or your source tree, do control access to them, do encrypt them at rest, and do cycle them periodically.
So, in order to bring these powerful capabilities to Box, we needed a set of images to send to the API and a destination for the results returned by the API. Now, when an image is uploaded to a folder in Box with the feature enabled—either via the web application or the API—the image is automatically labeled and text is automatically recognized and tagged using metadata. Plus, these metadata and representation values are then indexed for search, which means users can use our web application, a partner integration or even a custom application built on the Box Platform to search for keywords that might be found in their image content. And the search results will appear almost instantly based on the Google Cloud Vision’s analysis. Developers can also request the metadata on the image file via the Box API to use elsewhere in an application.
As you can imagine, the ability to automatically classify and label images provides dozens of powerful use cases for Box customers. In our beta, we're working with companies across a number of industries:
A retail customer is using image recognition in Box to optimize digital asset management of product photos. With automatic object detection and metadata labels, they can cut out manual tagging and organization of critical images that are central to multi-channel processes.
A major media company is using image recognition in Box to automatically tag massive amounts of inbound photos from freelance photographers around the globe. Previously, there was no way they could preview and tag every single image. Now they can automatically analyze more images than ever before, and unlock new ways to derive value from that content.
A global real estate firm is leveraging optical character recognition in Box to digitize workflows for paper-based leases and agreements, allowing their employees to skip a manual tagging process while classifying sensitive assets more quickly.
We're excited to continue experimenting with GCP's APIs to help our customers get more out of their content in Box. You can learn more about this from our initial announcement.