Simple Recommendation System with Python, SciKit Learn, and DockerSlim

Here at Slim.AI, we enjoy a wee dram of scotch when the occasion arises (like when we launch our new Container Optimization feature in closed beta!).

Personally, during the pandemic, I’ve become obsessed with learning more about machine-driven recommendation systems, and also trying to broaden my single malt horizons, in moderation of course. This problem seems perfectly suited to a cloud-native application, and is the great way to show off a bit of what the tech behind the Slim.AI can do to make cloud-native development easier.

With PyCon in full swing, it felt like the right time to dive into some Python-based tools for creating recommendations (SciKit Learn) and show how we can create a simple front-end with Flask to give users a way to get their own recs. Finally, we can use the Slim tools to both make the application production ready, and analyze both the original container and the slimmed one.

The Application
First, you’ll need an application that solves the problem at hand. In our case, we want a simple web app that can take a given product (in this case, a brand of scotch) and suggest something similar.

Thankfully, scotch recommendations are a known quantity in data science circles, thanks in large part to the example set out by University of Montreal data scientists Francois-Joseph Lapointe and Pierre Legendre in the mid-90s` . Their work is fascinating for aspiring ML Engineers, but for our purposes, we’ll just use a cleaned-up version of their data to create a quick recommendation system for our Scotch Recommender app, which we’ll call “Ron”, after that great scotch enthusiast, Ron Burgundy.

from Imgflip Meme Generator

We’ll use two standard Python data science libraries, pandas and scikit-learn (sklearn) to build our recommender.

# main.py
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

And we’ll read in the CSV file so we don’t have to deal with databases and Docker networks. The data is basically a bunch of sparse arrays detailing various facets of a scotch, from Color to Nose to Pallette to Region.

# import and clean data
df = pd.read_csv('scotch-ratings.csv')
df.index = df['NAME']

For a recommender, we’ll use a basic Cosine Similarity matrix to determine scotches that are similar to each other. (For a deeper discussion on recommendation systems, join our Discord and tag me [@psv] in a comment some time.)

# create quick cosine similarity model
distances = cosine_similarity(df[df.columns[1:]])
simdf = pd.DataFrame(data=distances,columns=df['NAME'],index=df.index)

# look up scotch by name brand and return 5 most similar
def get_sim_scotch(name):
    return simdf[name].sort_values(ascending=False)[1:6]

Perfect. I now have a function get_sim_scotch that I can pass a brand of scotch to and get back five recommendations along with their similarity scores.

But how am I, as a user, going to do so? And didn’t we say we wanted this to be cloud native?

First, we’ll need a way to put this app up on the web and let users interact with it. We’ll use Flask, a lightweight Python web service framework that makes serving basic websites easy. To do so, we’ll have to make our app a little more complex by adding folders for HTML templates and static CSS and JS files.

/app
-> ron.py
-> templates
-> static

I’ll create a single HTML file (‘index.html’) in the templates folder, and add the following to my ron.py file to enable Flask and serve the page.

from flask import Flask, render_template, request
...
app = Flask(__name__)
...


@app.route('/')
def index():
    return render_template('index.html',whiskeys=df['NAME'].values)

In my local environment, I can run ron.py (make sure you Flask, Pandas, and Scikit-Learn) and the page should serve up. I can add a little dropdown menu to the page that pushes users to pages tailored with recommendations.

<div>
<select name="list" onChange="window.location.href=this.value">
<option selected>Choose a Whiskey</option>
{% for w in whiskeys %}
  <option value="/sims/{{w}}">{{w}}</a></option>
{% endfor %}
</select>
</div>

And finally, I’ll create another route in my web application that feeds back the recommendations from my Cosine Similarity model.

# this is flask
@app.route('/sims/<brand>')
def sims(brand):
    if brand in df['NAME']:
        results = get_sim_scotch(brand).to_dict()
    else:
        results = {"":""}
    
    return render_template('index.html',whiskeys=df['NAME'].values, recs=results)

Being a little lazy, I’ll add this to ‘index.html’, which means the app can run on a single page whether it’s the first use or there are recommendations available.

<div>
<ol>{% for k,v in recs.items() %}<li><strong>{{k}}</strong> ({{v}})</li>{% endfor %}</ol>
</div>
</body>

For more on what we’ve done or for more info on Flask and Jinja, it’s related templating engine, please check out the documentation for those projects.

My app is now working locally, and, as a Dalwhinnie fan, I can select that from the dropdown, see five great recommendations, and it’s off I go to do QA…

My results for Dalwhinnie read:

  1. Miltonduff (0.9984176331926036)
  2. Inchgower (0.9982705313161198)
  3. Glendronac (0.9982434154148843)
  4. Cardhu (0.9982266269508561)
  5. Aultmore (0.998214961869034)

That Escalated Quickly

But wait, that’s not what I set out to do. The scotch will have to wait. I want to put this application in the cloud for others to enjoy, which means I need to deploy it. Enter Docker.

Docker hit the scene in 2013 and has revolutionized the way we move applications into the cloud and scale up when we get an unexpected bump in traffic.

To leverage Docker, I first need to make some infrastructure decisions. I want a Python-friendly web server, and can visit Docker Hub< to find a base image that can work for me.

But here things get a bit complicated. A quick search for “Flask” turns up a lot of community supported images, but nothing with the Official tag and a lot of them purpose-built for specific demos.

I could start with a base Ubuntu image or Python 3.7 container, but that could lead to a lot of manual steps to configure all the webserver dependencies, open ports, iron out version issues, and get it running. Most Docker-Flask tutorials online just show how to run a container locally, and that doesn’t do me too much good either, since I’ll need a web server in production.

Well, I know NGINX is one of the most popular web servers on the planet, as well as one of the most popular Docker image downloads with more than a billion downloads and counting. But does it play nice with Flask? A bit of research turns up a Community image tiangolo/uwsgi-nginx-flask:flask that seems built for hosting Flask applications.

I’ll collect my dependencies into my requirements.txt file using pip.

$ pip freeze > requirements.txt

And now I’m ready to create a small Dockerfile pulls the base image to create a container for our app. The second line copies the app into the Docker container’s /app directory.

FROM tiangolo/uwsgi-nginx-flask:flask
COPY ./app /app
RUN pip install -r requirements.txt
ENTRYPOINT ["python main.py"] 

If I try to docker build from the Dockerfile, however, I get an error. My app isn’t formatted in the specific way the image expects. We need to change the folder structure to sit in a folder and rename the base Flask application to main.py. The documentation also states that the app in my Flask app should be called just , which we’ve already done, but could have lead to a lot of refactoring.

The new project looks like:

Dockerfile
requirements.txt
-> /app
--> main.py
--> scotch-final.csv
--> /static
--> /templates
--->index.html

I can now build my image using Docker.

$ docker build -t slimpsv/scotchapp:latest .

The image builds successfully and I can now the image to create a container instance of my new app. We’ll give the container a cheeky name (how about ?) so we can more easily find and debug it later on.

$ docker run --name ron -dp 80:80 psvann/scotchapp:latest

In A Glass Case of Emotion: Debugging

With my image built and my container running, I think I’m off to the races. But wait. Where is this application even running? When debugging, my test environment, I simply hit <127.0.0.1:80> to access the Flask app. Doing so now turns up nothing.

Running docker logs ron shows a lot of scary warning messages, and after diving down the StackOverflow rabbit hole, I discover a couple of issues.

First, I need to update Flask to run on 0.0.0.0, not the default 127.0.0.1.

Enter the Slim Way

Now, it turns out that user is the incomparable Sebastien Ramirez, a Docker Captain and the creator of FastAPI and many useful open source projects, and this particular container is well-documented and well-supported, with more than one million downloads.

However, if I were to be building critical business infrastructure or had an InfoSec audit to clear, I’d be hard pressed to use a community image like this without knowing more about it. I also don’t know what else is in the image that I’m not using, meaning I’m probably carrying a lot of extra digital baggage through my deployment pipelines.

This is where Slim comes in. With Docker Slim installed, I can run the Docker-Slim X-Ray command to examine my new container.

$ docker-slim xray --target psvann/scotchapp:latest

The output tells me how layers are constructed in my image, along with other useful info, such as which ports are exposed (only 80 and 443 thankfully) and the total size (1.3 GB… ew…).

As you’d expect, Docker-Slim can also build, minify, and optimize my image for me.

$ docker-slim build --target psvann/scotchapp:latest

First, Docker-Slim builds the “fat” image and scans its ports to ensure everything is working correctly.

docker-slim[build]: info=container name=dockerslimk_21183_20210307122738 id=4bd8628473f3c1680e1fc6404a3cb916710ad0e633cad76934077dd8f17083c8 target.port.list=[55002,55003] target.port.info=[443/tcp => 0.0.0.0:55002,80/tcp => 0.0.0.0:55003] message='YOU CAN USE THESE PORTS TO INTERACT WITH THE CONTAINER'

Next, it outputs some standard-format security reports as json artifacts that I can review or share with my security team.

docker-slim[build]: info=results artifacts.location='/tmp/docker-slim-state/.docker-slim-state/images/20ba32caea86a189d9b615c8ee3d4de1da4f92a0f3fa8f9aafa79790e1d0d38e/artifacts'

docker-slim[build]: info=results artifacts.report=creport.json

docker-slim[build]: info=results artifacts.dockerfile.original=Dockerfile.fat

docker-slim[build]: info=results artifacts.dockerfile.new=Dockerfile

docker-slim[build]: info=results artifacts.seccomp=psvann-scotchapp-seccomp.json

docker-slim[build]: info=results artifacts.apparmor=psvann-scotchapp-apparmor-profile

Finally, and perhaps most importantly, Docker Slim will compress the container, removing unnecessary items and shrinking it to speed up rebuilds or scan time as it moves through my CI/CD pipeline.

docker-slim[build]: info=results status='MINIFIED BY 5.23X [1262702155 (1.3 GB) => 241397672 (241 MB)]'

Sixty Percent of the time It Works Every Time

Conclusion to come…

2 Likes

This is an excellent post, and as a professional data scientist I routinely deploy production applications and models with containers. I’m very interested in extending this further (especially with my favorite language of R), and here are some cool data sets we could try in a future project:

2 Likes