46

Did you meet Google Dataset Search?

 5 years ago
source link: https://www.tuicool.com/articles/juUFzqf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

A few weeks ago I wrote an article about the kind of decisions we usually face when it comes to starting a new DS/ML project comes…

  • The topic
  • Level of difficulty
  • The goal of the project
  • Will we have an audience?
  • Timing and deadlines

These five points are important, of course, but apart from all that, if we don’t have any data, we then don’t have any project. Data is king. If you don’t know how to get it then you have nothing. And beyond this, in my previous article, I highlighted some important questions you should make yourself about source, format and necessary actions to make yourself with the data.

Now…what if I told you, there’s a source where you can search for thousands of datasets, and datasets only, from all around the world, in several formats and easy for you to discover and access ? And who else than Google to make available something like this? Welcome to Dataset Search …the -not that new, but still in beta- Google’s search engine ONLY for datasets.

Now if you’re reading this, and thinking: well, data is already available in the web, this is actually providing nothing new. Fair enough, this is actually true. But searching for data can be messy and sometimes it is hard to discern in between good/true/complete/reliable and bad/false/incomplete/dodgy data. That’s why, as Google said when launching the tool, Dataset Search was born to enable easy access to all those many thousands of data repositories on the web, providing access to millions of datasets; and local and national governments around the world which publishes their data as well.

Dataset Search is available in both mobile and desktop, and it’s already working in several languages and with more coming soon, and it has no science beyond what is: a search engine. If you know how to use Google main search engine, then you’ll know how to use Dataset Search . Simply type whatever you’re looking for and Google will with as much information made available by the publisher, including:

  • A direct link to the download page in the source
  • Who is providing the dataset
  • Available download formats
  • Time period covered
  • Area covered
  • Variables measured
  • And a brief description of the dataset

For example, I’m from Uruguay, so I searched for ‘uruguay internet penetration’ obtaining 36 results. A number that I personally think it’s pretty impressive, considering we’re talking about a tiny country in South America. If we make the same search for the United Kingdom Dataset Search already shows +100 results available.

VR7Z7j3.png!web

In an article published some time ago by The Verge , Jeni Tennison , CEO of the Open Data Institute -an institution which with companies and governments to build an open, trustworthy data ecosystem for better decision making- said:

“Having Google involved should help make this project a success (…) Dataset search has always been a difficult thing to support, and I’m hopeful that Google stepping in will make it easier”

Of course, Dataset Search still has its limitations. As you can see right when entering it is still in beta. Even though Google has not given any update, features as advanced search or filtering on things as data size or the number of columns are probably going to come sometime soon.

Now, if you’re reading this, and you have somewhere in the web a published dataset, please carefully read the following lines: as any search engine, Dataset Search will get more and more useful as people make more and more datasets available. For that, Google is following an approach based on an open standard for describing information (schema.org) and anybody who publishes data can describe their dataset this way. Also, Google has defined some standards to describe data and enable users to find it. So if you want your data to appear in Dataset Search visit Google’s instructions on their guidelines for dataset provider , as well as their developers site , which also includes a link to ask questions and provide feedback.

Here are some examples of what can qualify as a dataset according to Google:

  • A table or a CSV file with some data
  • An organized collection of tables
  • A file in a proprietary format that contains data
  • A collection of files that together constitute some meaningful dataset
  • A data object in some other format to use with a special tool for processing
  • Images capturing data
  • Files relating to machine learning, such as trained parameters or neural network structure definitions

All in all, whether you’re an experienced researcher or an enthusiast data scientist like me, you should be excited about this initiative and willing to spread the word and enlarge the number of datasets currently available. This is nothing but good, not only for the data science community but also for anyone looking for a dataset. I’m already working on publishing some of my data, especially a dataset with statistics about hundreds of football players I built some time ago when I was trying to predict their market value. Meanwhile, feel free to check my GitHub repository and download that data, or any other you might find :)

Don’t forget to check out some of my last articles about 6 amateur mistakes I’ve made working with train-test splits , Web scraping in 5 minutes or any other available in my profile . And if you liked this article, don’t forget to follow me, and if you want to receive my latest articles directly on your email, just subscribe to my newsletter:

36zIRze.png!web

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK