Upload your Scikit-Learn Model

Upload your Scikit-Learn Model

In this lesson, you'll learn how to write a simple algorithm using a pre-trained Scikit-learn model we've provided. Before you begin, you may wish to review Creating an Algorithm and Using Hosted Data in Algorithms.

This algorithm shows how to deploy a random forest regression model that was trained on Boston housing prices data in order to predict the price of Boston houses that the model hasn't seen before.

Note that for any model that you are deploying on your Algorithmia account, you'll need to train and save the serialized model. For Scikit-learn models you can use the Python Pickle library to serialize and later deserialize your model file.

Before you get started, you'll need the files located in the GitHub repository associated with this lesson, or you can download the files directly from this lesson page under "Summary" in the right hand column (note it won't show if you're viewing this page at full screen mode). Once you've downloaded the files, go ahead and click the full screen mode icon at the bottom of this page to the right of the "Next" button to better view this lesson.

Now that you have the lesson files downloaded onto your computer, you'll need to upload the csv file and the pre-trained pickled Scikit-learn model to a data collection, but leave the demo.py file alone for now.

Remember how you learned to create data collections in the Algorithmia Hosted Data course? If you haven't taken that course, you should check it out now. Here, we'll create a data collection. You'll notice we've named our data collection "demo_files", but you can name yours as you like.

Once you've created a data collection, you can either drag and drop the files scikit-demo-boston-regression.pkl and boston_test_data.csv or you can click "Drop files here to upload" from where you stored the files on your computer:

upload data

Take note of the path created that starts with "data://" and shows your username and the data collection name along with your file name:

You'll want to use this path in your algorithm in order to point to your own data and model path, so we recommend keeping this data collection page open, and open a new tab where you can create your algorithm. This way you can easily copy and paste the paths from your data collections when you're ready to add them to the demo code sample.

uploaded models

Now go ahead and click the "Plus" icon in the navigation, and create your Scikit-learn algorithm, naming it as you like. You've already learned how to create an algorithm in a previous course so we won't go through the steps here, but note that you'll want to choose: "Python 3.x" for your language. The rest of the permissions and execution environment can stay under their default settings.

Remember from the Editing Your Algorithm course, once you create your algorithm you'll be able to edit it either through the CLI tools or the Web IDE. It's your choice how you want to interact with your algorithm, but this course will show working in the Web IDE:

new scikit-learn algorithm

Before we touch our code, we're going to add the required dependencies. Click on "Dependencies" found right above your source code. This will show a modal that is basically a requirements.txt file that pulls the stated libraries from PyPi. If you state the package name without a version number, you'll get the latest version that we support, otherwise you should state the version number or range.

In this example we have an older Scikit-learn model so we need a range of versions for our model to work, so go ahead and add these to the libraries already in your dependency file:

numpy

scikit-learn>=0.14,<0.18

So your whole dependency file will look like this:

dependency file

Now you'll want to remove the boilerplate code that exists in your newly created algorithm and copy and paste the code from the "demo.py" file found under the "Summary" section of this lesson if you didn't download it when you did the model and csv file. If you need to exit full screen mode to view the file, click on the icon to the right of the "Next" button on the bottom right of your screen.

Here is the full code from demo.py:

 
full algorithm code in ide

Notice in the first few lines of our script, we are importing the Python packages required by our algorithm. Then on line 7 we are creating the variable "client" in global scope to use throughout our algorithm. This will enable us to access our data in data collections via the Data API.

On line 12 inside the "load_model()" function, you'll want to replace that string with your path from your data collections for the pickled model file.

Then, notice on line 13 we are passing in that data collection path to the data api using:

client.file(file_path).getFile().name

And then we use the Pickle library to open the file.

Notice on line 20 is where we call the function "load_model()". This is important to note because you'll always want to load the model outside of the "apply()" function. This is so the model file only gets loaded into memory during the initial call within that session, so while the first call to your algorithm might take a bit of time depending on the size of your model, subsequent calls will be much faster. If you were to load your model inside the "apply()" function, then the model would get loaded with each call of your algorithm. Also, if you are tempted to add your model file as a Python module, and then import that module into your algorithm file, this will result in a loss of performance and we don't recommend it.

The next function called "process_input()" simply turns the csv file into a numpy array and we call that function within the "apply()" function where we will pass in the user provided "input".

The "input" argument is the data or any other input from the user that gets passed into your algorithm. It's important to support inputs from multiple data sources with exception handling, going beyond data collections like shown, but also for data files hosted in S3, Azure Blobs, or other data sources that we have data connectors for. For a great example of handling multiple types of files, or if you want to see a PyTorch algorithm in action, check out the Open Anomaly Detection algorithm on the Algorithmia Marketplace. You don't need an account to view the algorithm or the docs! 

Notice that we are returning the predicted data as output from our Scikit-learn model in the apply() function.

Now we will want to click the "Build" button on the top right of the web IDE. This will commit our code to a git repository. Every algorithm is backed by a git repository and as you are developing your algorithm, whenever you hit "Build", that will commit your code and return a hash version of your algorithm which you'll see in the Algorithmia console:

You can use that hash version to call your algorithm locally using one of the language clients for testing purposes while you work on perfecting your algorithm.

Note that you'll get a semantic version number once you publish your algorithm.

test algorithm in console

Now we are ready to test our algorithm. Go back to that data collections that we got our model path from, and copy/paste the path for our csv file into the Algorithmia console (wrapping it in quotes so it's a proper JSON formatted string) and hit return/enter on your keyboard:

testing algorithm output

 

If you're happy with the results, you can now publish your model, using that path to the CSV as your sample input. Great work!

Summary

Summary

Host your data and your model on Algorithmia's Hosted Data Collections for faster load times on your models.