Research

Open source and open data

There’s currently an ongoing debate about the value of data and whether internet companies should do more to share their data with others. At Google we’ve long believed that open data and open source are good not only for us and our industry, but also benefit the world at large.

Our commitment to open source and open data has led us to share datasets, services and software with everyone. For example, Google released the Open Images dataset of 36.5 million images containing nearly 20,000 categories of human-labeled objects. With this data, computer vision researchers can train image recognition systems. Similarly, the millions of annotated videos in the YouTube-8M collection can be used to train video recognition.

With respect to language processing, we’ve shared the Natural Questions database, which contains 307,373 human-generated questions and answers. We’ve also made available the Trillion Word Corpus, which is based on words used on public web pages, and the Ngram Viewer, that can be used to explore the more than 25 million books in Google Books. These collections can be used for statistical machine translation, speech recognition, spelling correction, entity detection, information extraction and other language research.

And these are only a few  examples of a much broader activity: Google AI currently lists 62 datasets of this sort that we’re making available to the research community.   

We also host a large number of publicly available datasets, such as the 20,000 Kaggle Open Datasets and the Cloud Public Datasets, which allows people to access frequently used public data directly from their workspace.

Google also offers Google Trends, a free service that enables anyone to see and download aggregate search activity since 2004 for Google Search, Image Search, News Search, Shopping and YouTube. You can get search information for countries, regions, metro areas and cities on a monthly, weekly, daily and even hourly basis. The Trends data is widely used by researchers in fields as varied as medicine and economics. According to Google Scholar, there are more than 21,000 research papers that cite Trends as a data source.

Google is also a major contributor to open source software.  Key examples of this include Android, our smartphone operating system, Chromium, the code base for our Chrome browser (now also powering many competitors), and TensorFlow, our machine learning system. Google’s release of Kubernetes changed cloud hosting forever, and has enabled innovation and competition across the cloud industry. Google is also the largest contributor of open source code to GitHub, a shared repository for software development. In 2017, Googlers made more than 250,000 changes to tens of thousands of projects on GitHub alone.

Finally, we’ve also released over 5,300 research reports written at Google, most of which have subsequently been published in scientific journals or conference proceedings.  

Of course, it is costly to create and compile this data, software, and research. So why do we release these materials free of charge?

First and foremost, our primary mission is “to organize the world’s information and make it universally accessible and useful.” Certainly one obvious way to make information universally accessible and useful is to give it away! 

Second, making these materials available stimulates scientific research outside of Google. We know we can’t do it all, and we spend a lot of time reading, understanding and often extending work done by others, some of which has been developed using tools and data we have provided to the research community. This mix of competition and cooperation among groups of researchers is what pushes science forward.

Third, when we hire new employees, it’s great if they can hit the ground running and already know and use the tools we have developed. Familiarity with our software and data makes engineers productive from their first day at work.

There are many more reasons to share research data, but these three alone justify the practice. We aren’t the only internet company to appreciate the power of open data, code, and open research. Our colleagues in academia, and many other companies follow the same practices for much the same reasons.

Of course, we can’t release all the data we use in our business. We need to protect user privacy, maintain confidentiality for business customers, and protect Google’s own intellectual property. But, subject to such considerations, we generally try to make our data as “universally accessible and useful” as possible.