In today’s world, artificial intelligence (AI) is seen as a double-edged sword. On one side, there is the aspect of having smarter homes, improved health technology, and the prospect of having driverless vans to deliver groceries. On the other side, there is the issue of privacy violations, discrimination and diverse effects of technologies in a negative way that is not yet discovered.
Various risks are involved in AI-related to data difficulties, comprising of ingesting high-quality data before the process of sorting, linking, and programming even takes place. In this article, 15 sources of machine learning datasets will be analyzed.
1) Google Open Images
The Google Open Images is mainly a dataset that comprises of ~9 million URLs to images that have been interpreted with labels spread out over 6000 categories. The people at Google ensure that they make the datasets as practical as possible which means that labels cover more real-life entities than the 1000 ImageNet classes.
The image-level annotations have been populated automatically through a vision model similar to the Google Cloud Vision API. The dataset is mainly a product of a collaboration between Google, CMU, and Cornell universities.
2) ImageNet
The ImageNet is an image dataset that is organized according to the WorldNet hierarchy. The meaningful concept in WorldNet is mainly described through the use of multiple words or word phrases which is known as a “synonym set” or “synset”. Within WorldNet, there are more than 100,000 synsets, most of them being nouns (80,000+). The images of each concept are quality controlled and human-annotated.
3) Waymo Open Dataset
The Waymo Open Dataset includes high-resolution sensor data which is collected by Waymo self-driving cars in a varied diversity of conditions. This dataset mainly comprises lidar and camera data from around 1000 segments of the 20s each of which is gathered at 10Hz in different geographies and conditions.
Their sensor data is mainly 1 mid-range lidar, 4 short-range lidars, 5 cameras, synchronized lidar and camera data, lidar to camera projections, and sensor calibrations and vehicle pose. The labelled data has 4 object classes, high-quality labels for lidar data in each segment, and 12M 3D bounding box labels.
Here is the Github link to Waymo Open Dataset
4) UCI Machine Learning Repository
The UCI is a repository of 100s of datasets from the University of California, School of Information and Computer Science. This particular repository categorizes datasets through the type of machine learning problem. Users would be able to discover datasets for univariate and multivariate time-series datasets, classification, regression or recommendation systems.
Here is the Github link to UCI Machine Learning Repository
5) Xview
Xview is considered to be one of the largest publicly available datasets of overhead imagery. It comprises images taken from complex scenes from all over the world, annotated using the bounding boxes. The DIUxxView 2018 Detection Challenge is focused on accelerating progress in four areas of computer vision frontiers which are reducing minimum resolution for detection, improving the learning efficiency, enabling the discovery of more object classes, and improving detection of fine-grained classes.
Here is the Github link to Xview Dataset
6) MS COCO
COCO huge-scale object detection, segmentation, and captioning dataset. There are numerous features of this dataset which are object segmentation, 80 object categories, recognition in context, 5 captions per image, among many others.
Here is the Github link to MS COCO Dataset.
7) Visual Genome
The visual genome is a dataset or a knowledge base that comprises of ongoing effort to connect with structured image concepts to language.
Here is the Github link to Visual Genome Dataset