Initial data acquisition and data analysis
In order to get an idea of what our data looks like, we need to look at it! The Jupyter Notebook, embedded below, will show steps to load your data into Python and find some basic statistics to use them to identify potentially issues with new data that arrives.
This process is simply the exploratory step, we will build part of the pipeline in the next step. It’s imporant to have notebooks involved once in a while in order to make sure we know what we’re looking at.
Keep in mind, this is the first look at the data and we’re checking out some very basic testing. These tests will become more robust and meaningful as we continue to build out this pipeline.
ETL (Extract, Transform, Load) is not always the favorite part of a data scientist’s job but it’s an absolute necessity in the real world. If you don’t understand this process, you will have a basic grasp on it by the time you’re done with these lessons. I will be covering:
- Data exploration
- Understanding your data
- Looking for red flags
- Utilizing both statistics and data visualization
- Checking your data for issues
- Identifying things outside of the “normal” range
- Deciding what to do with NaN or missing values
- Discovering data with the wrong data type
- How to clean and transform your data
- Utilize the pandas library
- Utilize pyjanitor
- Getting data into tidy format
- Dealing with your database
- Determining whether or not you actually need a database
- Choosing the right database
- Deciding between relational and NoSQL
- Basic schema design and normalization
- Using an ORM – SQLAlchemy to insert data
- Building a data pipeline
- Separate your ETL into parts
- Utilize luigi to keep you on track
- Error montitoring
Stoltzmaniac Fans – It’s time for a #100DaysOfCode update.
I have completed 11 days of the challenge. Let me tell you, it has been a blast and I have already learned a lot. In this post I’ll walk you through what I’ve done thus far. Here is a link to the code on my GitHub repository.
As you may recall from my previous post I set out to create a flask application to host data science projects for the Meetup group that I organize (Fort Collins Data Science Meetup). My goal is to provide people with an outlet to run code online where they will get the benefits of having a server and a dynamic UI. This will improve the group’s collaboration and Git skills along with allowing people to showcase their work without having to build infrastructure. In case you’re wondering, I built this using Docker Compose, Flask, NGINX, PostgreSQL, and MongoDB.
In order to keep from boring myself to sleep while writing this, I’m going to keep it short and to the point. You might be asking, “what does this application look like?” That’s a great question. It’s a normal website where people contribute Python scripts to do some sort of data processing or analysis. For example, here’s a word cloud generator where the user inserts a Twitter handle with a link to a logo of some sort and then a word cloud is created from all of the most recent tweets! Here is @realdonaldtrump as the Republican elephant and @barackobama as the Democrat donkey.
Starting the 100 Days of Code ( #100DaysOfCode ) challenge
I am always looking to boost my coding skills and as I watch everyone make resolutions for the year, I couldn’t help but think I should try this challenge. In case you don’t know what I’m referring to, one resource is https://www.100daysofcode.com/ – which really gives you a good overview of what the challenge involves.
What will I be building?
I am a project-oriented person, so I will be building a web application that runs sentiment analysis on text data from APIs.
The basic topics I hope to cover:
- Flask Application
- User login
- Store data from external APIs
- Utilize PostgreSQL and MongoDB
- Jinja2 templating
- Back end API development
- Luigi ETL pipeline
I will try and send out a blog update every week or two with highlights! I will also be updating GitHub as I go along. Part of the challenge is also posting on Twitter, so each day I’ll be using the hashtag #100DaysOfCode and you can follow me @stoltzmaniac
Fertility is something people don’t typically discuss openly in the US, which isn’t a surprise because it is an incredibly personal topic. In fact, it’s really difficult to even write a blog post about, I wrote this over a year ago and I’m only getting around to posting it now. It took us roughly 7 months to conceive a baby, and I’m proud to say we now have a happy baby boy!
However, every negative pregnancy test you see takes an emotional toll on you (and can even put strain on some marriages). During that time, I found that research online wasn’t extremely helpful. My wife and I found it relatively difficult to find answers to two very important questions:
- What are the odds of a couple conceiving each month?
- How much of a factor does age play?
I need to start this off by saying, I am not a doctor (nor do I play one on TV). In fact, I’m just going to start my exploration of this topic by first reading some blogs on the topic. This isn’t typically a great option, but then again, I’m writing a blog as well… What could go wrong, a blog based off of other blogs which might be discussed in another blog? I digress.