Google Vision API in R – RoogleVision

Using the Google Vision API in R

Utilizing RoogleVision

After doing my post last month on OpenCV and face detection, I started looking into other algorithms used for pattern detection in images. As it turns out, Google has done a phenomenal job with their Vision API. It’s absolutely incredible the amount of information it can spit back to you by simply sending it a picture.

Also, it’s 100% free! I believe that includes 1000 images per month. Amazing!

In this post I’m going to walk you through the absolute basics of accessing the power of the Google Vision API using the RoogleVision package in R.

As always, we’ll start off loading some libraries. I wrote some extra notation around where you can install them within the code.

# Normal Libraries

# devtools::install_github("flovv/RoogleVision")
library(jsonlite) # to import credentials

# For image processing
# source("")
# biocLite("EBImage")

# For Latitude Longitude Map

Google Authentication

In order to use the API, you have to authenticate. There is plenty of documentation out there about how to setup an account, create a project, download credentials, etc. Head over to Google Cloud Console if you don’t have an account already.

# Credentials file I downloaded from the cloud console
creds = fromJSON('credentials.json')

# Google Authentication - Use Your Credentials
# options("googleAuthR.client_id" = "")
# options("googleAuthR.client_secret" = "")

options("googleAuthR.client_id" = creds$installed$client_id)
options("googleAuthR.client_secret" = creds$installed$client_secret)
options("googleAuthR.scopes.selected" = c(""))

Now You’re Ready to Go

The function getGoogleVisionResponse takes three arguments:

  1. imagePath
  2. feature
  3. numResults

Numbers 1 and 3 are self-explanatory, “feature” has 5 options:


These are self-explanatory but it’s nice to see each one in action.

As a side note: there are also other features that the API has which aren’t included (yet) in the RoogleVision package such as “Safe Search” which identifies inappropriate content, “Properties” which identifies dominant colors and aspect ratios and a few others can be found at the Cloud Vision website

Label Detection

This is used to help determine content within the photo. It can basically add a level of metadata around the image.

Here is a photo of our dog when we hiked up to Audubon Peak in Colorado:

plot of chunk unnamed-chunk-2

dog_mountain_label = getGoogleVisionResponse('dog_mountain.jpg',
                                              feature = 'LABEL_DETECTION')
##            mid           description     score
## 1     /m/09d_r              mountain 0.9188690
## 2 /g/11jxkqbpp mountainous landforms 0.9009549
## 3    /m/023bbt            wilderness 0.8733696
## 4     /m/0kpmf             dog breed 0.8398435
## 5    /m/0d4djn            dog hiking 0.8352048

All 5 responses were incredibly accurate! The “score” that is returned is how confident the Google Vision algorithms are, so there’s a 91.9% chance a mountain is prominent in this photo. I like “dog hiking” the best – considering that’s what we were doing at the time. Kind of a little bit too accurate…

Landmark Detection

This is a feature designed to specifically pick out a recognizable landmark! It provides the position in the image along with the geolocation of the landmark (in longitude and latitude).

My wife and I took this selfie in at the Linderhof Castle in Bavaria, Germany.

us_castle <- readImage('us_castle_2.jpg')

plot of chunk unnamed-chunk-4

The response from the Google Vision API was spot on. It returned “Linderhof Palace” as the description. It also provided a score (I reduced the resolution of the image which hurt the score), a boundingPoly field and locations.

  • Bounding Poly – gives x,y coordinates for a polygon around the landmark in the image
  • Locations – provides longitude,latitude coordinates
us_landmark = getGoogleVisionResponse('us_castle_2.jpg',
                                      feature = 'LANDMARK_DETECTION')
##         mid      description     score
## 1 /m/066h19 Linderhof Palace 0.4665011
##                               vertices          locations
## 1 25, 382, 382, 25, 178, 178, 659, 659 47.57127, 10.96072

I plotted the polygon over the image using the coordinates returned. It does a great job (certainly not perfect) of getting the castle identified. It’s a bit tough to say what the actual “landmark” would be in this case due to the fact the fountains, stairs and grounds are certainly important and are a key part of the castle.

us_castle <- readImage('us_castle_2.jpg')
xs = us_landmark$boundingPoly$vertices[[1]][1][[1]]
ys = us_landmark$boundingPoly$vertices[[1]][2][[1]]

plot of chunk unnamed-chunk-6

Turning to the locations – I plotted this using the leaflet library. If you haven’t used leaflet, start doing so immediately. I’m a huge fan of it due to speed and simplicity. There are a lot of customization options available as well that you can check out.

The location = spot on! While it isn’t a shock to me that Google could provide the location of “Linderhof Castle” – it is amazing to me that I don’t have to write a web crawler search function to find it myself! That’s just one of many little luxuries they have built into this API.

latt = us_landmark$locations[[1]][[1]][[1]]
lon = us_landmark$locations[[1]][[1]][[2]]
m = leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  setView(lng = lon, lat = latt, zoom = 5) %>%
  addMarkers(lng = lon, lat = latt)

Face Detection

My last blog post showed the OpenCV package utilizing the haar cascade algorithm in action. I didn’t dig into Google’s algorithms to figure out what is under the hood, but it provides similar results. However, rather than layering in each subsequent “find the eyes” and “find the mouth” and …etc… it returns more than you ever needed to know.

  • Bounding Poly = highest level polygon
  • FD Bounding Poly = polygon surrounding each face
  • Landmarks = (funny name) includes each feature of the face (left eye, right eye, etc.)
  • Roll Angle, Pan Angle, Tilt Angle = all of the different angles you’d need per face
  • Confidence (detection and landmarking) = how certain the algorithm is that it’s accurate
  • Joy, sorrow, anger, surprise, under exposed, blurred, headwear likelihoods = how likely it is that each face contains that emotion or characteristic

The likelihoods is another amazing piece of information returned! I have run about 20 images through this API and every single one has been accurate – very impressive!

I wanted to showcase the face detection and headwear first. Here’s a picture of my wife and I at “The Bean” in Chicago (side note: it’s awesome! I thought it was going to be really silly, but you can really have a lot of fun with all of the angles and reflections):

us_hats_pic <- readImage('us_hats.jpg')

plot of chunk unnamed-chunk-8

us_hats = getGoogleVisionResponse('us_hats.jpg',
                                      feature = 'FACE_DETECTION')
##                                 vertices
## 1 295, 410, 410, 295, 164, 164, 297, 297
## 2 353, 455, 455, 353, 261, 261, 381, 381
##                                 vertices
## 1 327, 402, 402, 327, 206, 206, 280, 280
## 2 368, 439, 439, 368, 298, 298, 370, 370
## landmarks...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           landmarks
##   rollAngle panAngle tiltAngle detectionConfidence landmarkingConfidence
## 1  7.103324 23.46835 -2.816312           0.9877176             0.7072066
## 2  2.510939 -1.17956 -7.393063           0.9997375             0.7268016
##   joyLikelihood sorrowLikelihood angerLikelihood surpriseLikelihood
##   underExposedLikelihood blurredLikelihood headwearLikelihood
us_hats_pic <- readImage('us_hats.jpg')

xs1 = us_hats$fdBoundingPoly$vertices[[1]][1][[1]]
ys1 = us_hats$fdBoundingPoly$vertices[[1]][2][[1]]

xs2 = us_hats$fdBoundingPoly$vertices[[2]][1][[1]]
ys2 = us_hats$fdBoundingPoly$vertices[[2]][2][[1]]


plot of chunk unnamed-chunk-10

Here’s a shot that should be familiar (copied directly from my last blog) – and I wanted to highlight the different features that can be detected. Look at how many points are perfectly placed:

my_face_pic <- readImage('my_face.jpg')

plot of chunk unnamed-chunk-11

my_face = getGoogleVisionResponse('my_face.jpg',
                                      feature = 'FACE_DETECTION')
##                               vertices
## 1 456, 877, 877, 456, NA, NA, 473, 473
##                               vertices
## 1 515, 813, 813, 515, 98, 98, 395, 395
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             landmarks
## landmarks ...
##    rollAngle  panAngle tiltAngle detectionConfidence landmarkingConfidence
## 1 -0.6375801 -2.120439  5.706552            0.996818             0.8222974
##   joyLikelihood sorrowLikelihood angerLikelihood surpriseLikelihood
##   underExposedLikelihood blurredLikelihood headwearLikelihood
## [[1]]
##                            type position.x position.y    position.z
## 1                      LEFT_EYE   598.7636   192.1949  -0.001859295
## 2                     RIGHT_EYE   723.1612   192.4955  -4.805475700
## 3          LEFT_OF_LEFT_EYEBROW   556.1954   165.2836  15.825399000
## 4         RIGHT_OF_LEFT_EYEBROW   628.8224   159.9029 -23.345352000
## 5         LEFT_OF_RIGHT_EYEBROW   693.0257   160.6680 -25.614508000
## 6        RIGHT_OF_RIGHT_EYEBROW   767.7514   164.2806   7.637372000
## 7         MIDPOINT_BETWEEN_EYES   661.2344   185.0575 -29.068363000
## 8                      NOSE_TIP   661.9072   260.9006 -74.153710000
my_face_pic <- readImage('my_face.jpg')

xs1 = my_face$fdBoundingPoly$vertices[[1]][1][[1]]
ys1 = my_face$fdBoundingPoly$vertices[[1]][2][[1]]

xs2 = my_face$landmarks[[1]][[2]][[1]]
ys2 = my_face$landmarks[[1]][[2]][[2]]

points(x=xs2,y=ys2,lwd=2, col='lightblue')

plot of chunk unnamed-chunk-14

Logo Detection

To continue along the Chicago trip, we drove by Wrigley field and I took a really bad photo of the sign from a moving car as it was under construction. It’s nice because it has a lot of different lines and writing the Toyota logo isn’t incredibly prominent or necessarily fit to brand colors.

This call returns:

  • Description = Brand name of the logo detected
  • Score = Confidence of prediction accuracy
  • Bounding Poly = (Again) coordinates of the logo
wrigley_image <- readImage('wrigley_text.jpg')

plot of chunk unnamed-chunk-15

wrigley_logo = getGoogleVisionResponse('wrigley_text.jpg',
                                   feature = 'LOGO_DETECTION')
##           mid description     score                               vertices
## 1 /g/1tk6469q      Toyota 0.3126611 435, 551, 551, 435, 449, 449, 476, 476
wrigley_image <- readImage('wrigley_text.jpg')
xs = wrigley_logo$boundingPoly$vertices[[1]][[1]]
ys = wrigley_logo$boundingPoly$vertices[[1]][[2]]

plot of chunk unnamed-chunk-17

Text Detection

I’ll continue using the Wrigley Field picture. There is text all over the place and it’s fun to see what is captured and what isn’t. It appears as if the curved text at the top “field” isn’t easily interpreted as text. However, the rest is caught and the words are captured.

The response sent back is a bit more difficult to interpret than the rest of the API calls – it breaks things apart by word but also returns everything as one line. Here’s what comes back:

  • Locale = language, returned as source
  • Description = the text (the first line is everything, and then the rest are indiviudal words)
  • Bounding Poly = I’m sure you can guess by now
wrigley_text = getGoogleVisionResponse('wrigley_text.jpg',
                                   feature = 'TEXT_DETECTION')
##   locale
## 1     en

##                                                                                                        description
##                                 vertices
## 1   55, 657, 657, 55, 210, 210, 852, 852
## 2 343, 482, 484, 345, 217, 211, 260, 266

wrigley_image <- readImage('wrigley_text.jpg')

for(i in 1:length(wrigley_text$boundingPoly$vertices)){
  xs = wrigley_text$boundingPoly$vertices[[i]]$x
  ys = wrigley_text$boundingPoly$vertices[[i]]$y

plot of chunk unnamed-chunk-19

That’s about it for the basics of using the Google Vision API with the RoogleVision library. I highly recommend tinkering around with it a bit, especially because it won’t cost you a dime.

While I do enjoy the math under the hood and the thinking required to understand alrgorithms, I do think these sorts of API’s will become the way of the future for data science. Outside of specific use cases or special industries, it seems hard to imagine wanting to try and create algorithms that would be better than ones created for mass consumption. As long as they’re fast, free and accurate, I’m all about making my life easier! From the hiring perspective, I much prefer someone who can get the job done over someone who can slightly improve performance (as always, there are many cases where this doesn’t apply).

Please comment if you are utilizing any of the Google API’s for business purposes, I would love to hear it!

As always you can find this on my GitHub