At this point, we have seen what our data looks like, how it is stored, and what some basic tests might look like. In this post, we start to look at how this might be created as a report to aid in the ETL process.
Many companies find themselves in positions where a CSV (or something similar) is delivered from outside of their organization. In this example, we are assuming that it is being placed into a folder called “new_data”. Our code picks up the file and compares it to what we would expect in terms of test data in order to decide whether or not to move forward with the ETL process.
This Jupyter notebook could be processed each time the file is updated and could be sent to stakeholders before data is processed. It contains a very basic level of testing and visualization, but the idea should get you started. When it runs, tests confirm whether or not the data fits within certain constraints and passes some integrity tests. The data is then plotted and a final output at the bottom shows which tests have passed / failed.
Fertility is something people don’t typically discuss openly in the US, which isn’t a surprise because it is an incredibly personal topic. In fact, it’s really difficult to even write a blog post about, I wrote this over a year ago and I’m only getting around to posting it now. It took us roughly 7 months to conceive a baby, and I’m proud to say we now have a happy baby boy!
However, every negative pregnancy test you see takes an emotional toll on you (and can even put strain on some marriages). During that time, I found that research online wasn’t extremely helpful. My wife and I found it relatively difficult to find answers to two very important questions:
What are the odds of a couple conceiving each month?
How much of a factor does age play?
I need to start this off by saying, I am not a doctor (nor do I play one on TV). In fact, I’m just going to start my exploration of this topic by first reading some blogs on the topic. This isn’t typically a great option, but then again, I’m writing a blog as well… What could go wrong, a blog based off of other blogs which might be discussed in another blog? I digress.
Recently, I started looking into data sets to compete in Go Code Colorado (check it out if you live in CO). The problem with such diversity in data sets is finding a way to quickly visualize the data and do exploratory analysis. While tools like Tableau make data visualization extremely easy, the data isn’t always properly formatted to be easily consumed. Here’s are a few tips to help speed up your exploratory data analysis!
We’ll use data from two sources to aid with this example:
Is George Washington better looking on the dollar bill or represented by a word cloud built with the text of The Constitution of the USA?
A colleague recently asked me that exact question. If you want to be taken seriously in the data science world, you better be able to answer something like this!
I decided that it would be fun to show off a Python package by Andreas Mueller called word_cloud (here) to make a fun image with the text of the Constitution and an image of one of the Founding Fathers.
I must warn you, word clouds are like pie charts people like the way they look but clouds don’t provide much information. That said, this package is really neat because it allows you to easily turn text into images utilizing masks, colors, and numpy!
I’ll keep this post short, what you want to do is simple:
Select an image which you would like to mimic in both color and shape
Read your image into Python using numpy
Read your text into Python using open() and read()
The disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious about the history of hurricanes and tropical storms so I found a data set on data.world and started some basic Exploratory data analysis (EDA).
EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are handy and provide a great place to start. They are an easy-to-use method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kick-off the EDA process but those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.