Is George Washington better looking on the dollar bill or represented by a word cloud built with the text of The Constitution of the USA?
A colleague recently asked me that exact question. If you want to be taken seriously in the data science world, you better be able to answer something like this!
I decided that it would be fun to show off a Python package by Andreas Mueller called
word_cloud (here) to make a fun image with the text of the Constitution and an image of one of the Founding Fathers.
I must warn you, word clouds are like pie charts people like the way they look but clouds don’t provide much information. That said, this package is really neat because it allows you to easily turn text into images utilizing masks, colors, and
I’ll keep this post short, what you want to do is simple:
- Select an image which you would like to mimic in both color and shape
- Read your image into Python using numpy
- Read your text into Python using
open() and read()
- Make your word cloud!
In my code, I also added some fun little stuff from Python’s
nltk (Natural Language Toolkit) library which is heavily used in Natural Language Processing. You’ll find some basic text manipulation techniques to tokenize your data, remove stop words, and find the most commonly used words. Eventually, I’ll get around to writing some posts on utilizing Python’s
tidytext (from Julia Silge and David Robinson in R).
After deep analysis, I concluded that George Washington looks better on the dollar bill than he does as a word cloud :). Have fun playing around with this, I sure did!
As always, you can find this code on my GitHub.
Side note: no one really asked me about G.W. and a word cloud, I randomly pulled this out of a hat.
# coding: utf-8
from nltk.corpus import stopwords
from string import punctuation
from collections import Counter
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
from matplotlib import rcParams
from PIL import Image
import numpy as np
# Preparation work
stop = stopwords.words('english')
stop = set(stop)
def common_words(tokens, num):
# Import Data
f = open('const.txt')
text = f.read()
tmp_toks = nltk.word_tokenize(text)
data = [w.lower() for w in tmp_toks]
# Clear stop words
data = [word for word in data if word not in stop]
# Clear punctuation
data = [word for word in data if word not in punctuation]
wordcloud = WordCloud().generate(text)
usa_coloring = np.array(Image.open('george.jpg'))
wc = WordCloud(background_color='white',
image_colors = ImageColorGenerator(usa_coloring)
plt.imshow(usa_coloring, cmap=plt.cm.gray, interpolation="bilinear")