Completing the first data science project that you took is a major milestone on the road to becoming a data scientist and helps to both reinforce your skills and provide something you can discuss during the interview process. It’s also an intimidating process. The first step is to find an appropriate, interesting data science dataset. Deciding on how large and how messy a dataset to work with while cleaning data is an integral part of data science, starting with a clean dataset for the first project so that focus on the analysis rather than on cleaning the data.
The U.S. Census Bureau publishes reams of demographic data at the state, city, and even zip code level. It is a fantastic dataset for students interested in creating geographic data visualizations and can be accessed on the Census Bureau website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the choroplethr. In general, this data is very clean, very comprehensive and nuanced, and a good choice for data visualization projects as it does not require you to manually clean it.
The FBI crime data is fascinating and one of the most
interesting data sets on this list. If you’re interested in analyzing time
series data, you can use it to chart changes in crime rates at the national
level over a 20-year period. Alternatively, you can
look at the data geographically.
The Centers for Disease Control and Prevention maintains
a database on cause of death. The data can be segmented
in almost every way imaginable: age, race, year, and so on. Since this is
such a massive data set, it’s good to use for data processing projects.
The Centers for Medicare & Medicaid Services
maintains a database on quality of care at more than 4,000 Medicare-certified
hospitals across the U.S., providing for interesting comparisons. Since
this data will be spread over multiple files and might take a bit of research
to fully understand, this could be a good data cleaning project.
The U.S. government also has data about cancer incidence, again
segmented by age, race, gender, year, and other factors. It comes from the
National Cancer Institute’s Surveillance, Epidemiology, and End Results
Program. The data goes back to 1975 and has 18 databases, so you’ll have
plenty of options for analysis.
Many important economic indicators for the United States
(like unemployment and inflation) can be found on the Bureau of Labor
Statistics website. Most of the data can be segmented both by time
and by geography. This large dataset can be used for data processing and data
visualization projects.
7. Bureau of Economic Analysis
The Bureau of Economic Analysis also has
national and regional economic data, including gross domestic product and
exchange rates. There’s a huge range in the different groups of data found
here—you can browse by place, economic accounts, and topics—and these groups
are organized into even smaller subsets throughout.
For access to global financial statistics and other data,
check out the International Monetary Fund’s website. There
are a few different sets here, so you can use them for a wide range of projects
like visualization or even cleaning.
Predicting stock prices is a major application of data
analysis and machine learning. One relevant dataset to explore is the weekly returns of the Dow Jones Index from the Center for
Machine Learning and Intelligent Systems at the University of California,
Irvine. This is one of the sets specially made for machine learning
projects.
10. Data.gov.uk
The British government’s official data
portal offers access to tens of thousands of datasets on topics
such as crime, education, transportation, and health. Since this is an
open data source with millions of entries, you’ll be able to practice data
cleaning across different groupings.
11. Enron Emails
After the collapse of Enron, a free dataset of
roughly 500,000 emails
with message text and metadata were released.
The dataset is now famous and provides an excellent testing ground
for text-related analysis. You also can
explore other research uses of this dataset through the page.
If you’re interested in truly massive data, the Ngram viewer dataset counts the frequency of words and
phrases by year across a huge number of text sources. The resulting file is 2.2
TB! While this might be difficult to use for a visualization project, it’s an
excellent dataset for cleaning as it’s nuanced and will require additional
research.
13. UNICEF
If data about the lives of children around the world is
of interest, UNICEF is the most credible source. The organization’s public
data sets touch upon nutrition, immunization, and education, among others,
making for a great resource for visualization projects.
14. Reddit Comments
Reddit released a really interesting dataset of every comment that has ever been made
on the site. It’s over a terabyte of data uncompressed, so if you want a
smaller dataset to work with Kaggle has hosted the comments from May 2015 on their site.
15. Wikipedia
16. Lending Club
Lending Club provides data about loan applications it has
rejected as well as the performance of loans that it has issued. The free
dataset lends itself both to categorization techniques (will a given loan
default) as well as regressions (how much will be paid back on a given loan).
17. Walmart
Walmart has released historical sales data for 45 stores located in
different regions across the United States. This offers a huge set of data
to read and analyze, and many different questions to ask about it—making for a
solid resource for data processing projects.
18. Airbnb
Inside Airbnb offers different datasets related to Airbnb listings in dozens of
cities around the world. This dataset, given its specificity to the travel
industry, is great for practicing your visualization skills.
19. Yelp
Google has one of the most interesting datasets to
analyze. While we’re using “e-learning” in this example, you can explore
different search terms and go as far back as 2004. All you have to do is download the dataset into a CSV file to
analyze the data outside of the Google Trends webpage. You can download data on
interest levels for a given search term, interest by location, related topics,
categories, search types (video, images, etc), and more! Google also lists
out a large collection of publicly available datasets on the Google Public
Data Explorer. Make sure to check it out!
For students looking to learn through analysis, the World Trade Organization offers many datasets available for download that
give students insight into trade flows and predictions. Those with a knack for
business insights will particularly appreciate this set this dataset, as it
provides tons of opportunities to not only get into data science but also
deepen your understanding of the trading industry.
22. International
Monetary Fund
This site has several
free excel datasets for download on different key economic
indicators. From Gross Domestic Product (GDP) to inflation. Taking the data
from multiple files and condensing it for clarity and patterns is an excellent
(and satisfying!) way to practice data cleaning.
23. U.S Energy Information Administration Open Data
This source has free and open
data that is available in the bulk file, in Excel via the
add-in, in Google Sheets via an add-on, and via widgets that embed interactive
data visualizations of EIA data on any website. The website also notes that
the EIA data is
available in machine-readable formats, making it a great resource
for machine learning projects.
24. TensorFlow Image Dataset: CelebA
For practice with machine learning, you’ll need a
specialized dataset such as TensorFlow. The TensorFlow library includes all
sorts of tools, models, and machine learning guides along with its
datasets. CelebA is an extremely large, publicly available online, and
contains over 200,000 celebrity images.
Another TensorFlow set is C4: Common Crawl’s
Web Crawl Corpus. Available in 40+ languages, this open-source
repository of web page data spans seven years of data, making for an excellent
resource for machine learning dataset practice.
Our World In Data is an interesting case study in open
data. Not only can you find the underlying public data sets, but visualizations
are already presented in order to splice up the data. The site mainly deals
with large-scale country-by-country comparisons on important statistical
trends, from the rate of literacy to economic progress.
Do you want some insight into the emergence of
cryptocurrencies? Cryptodatadownload offers free public data sets of
cryptocurrency exchanges and historical data that tracks the exchanges and
prices of cryptocurrencies. Use it to do historical analyses or try to piece together
if you can predict the madness.
28. Kaggle Data
Kaggle datasets are an aggregation of user-submitted
and curated datasets.
It’s a bit like Reddit for datasets, with rich tooling to get started with different
datasets, comment, and upvote functionality, as well as a view on which
projects are already being worked on in Kaggle. A great all-around resource for
a variety of open datasets across many domains.
29. Github Collection (Open Data)
GitHub is the central hub of open data and open-source
code. With different open datasets that are hosted on GitHub itself
(including data on every member of Congress from 1789 onwards and data on food
inspections in Chicago), this collection lets you get familiar with Github and
the vast amount of open data that resides on it.
30. Github (Awesome Public Data sets)
The Awesome collection of repositories on Github is a
user-contributed collection of resources. In this case, the repository contains a variety of open data sources
categorized across different domains. Use this resource to find different open
datasets—and contribute back to it if you can.
31. Microsoft Azure Open Datasets
Microsoft Azure is the cloud solution provided by
Microsoft: they have a variety of open public data sets that are connected to their
Azure services. You can access featured datasets on everything from weather to
satellite imagery.
Google BigQuery is Google’s cloud solution for processing
large datasets in a SQL-like manner. You can have a preview of these very large
public data sets with the subreddit
Wiki dedicated to BigQuery with everything from very rich data
from Wikipedia, to datasets dedicated to cancer genomics.
33. SafeGraph Data
SafeGraph is a popular source for all things location
data. While their data is not free to everyone, academics can download the data for free for locations in
the U.S., Canada, and the UK via the SafeGraph Shop.
This data is great for economists, social scientists, public health researchers, and anyone who is interested in knowing where a location is and how people move between these locations. It seems to be popular since SafeGraph data has been used in over 600 academic papers.
NOTE : If you find any broken links, please comment down below.