Using Hosted Data in your Algorithm

Using Hosted Data in your Algorithm

If you aren't already familiar with Algorithmia's Hosted Data, Data Connectors, and Data API, please take a minute to read more about them.

Now that you know how to read and write to Data Collections from local code, you'll be pleased to know that you can do the same from inside an Algorithm. In fact, this is the main way you'll get your ML models and other large data in and out of Algorithms, and is the best way to transfer binary files such as images.

Reading and Writing Files

Let's start with a simple example: you've written a simple Algorithm which greyscales images: the user provides you with a color image, and you return a black-and-white version.  Your local Python code might look something like this:

from PIL import Image
def apply(filename):
    image = Image.open(filename)
    result = image.convert('L')
    result.save('result.png')

However, the user won't be sending you a local filename, because your code (running on Algorithmia's servers) doesn't have direct access to the remote user's filesystem. Instead, they'll send you a Data URL ("data://user/collection/file.jpg") pointing to a file in their collections on Algorithmia.

To use this file, your Algorithm must fetch it from the collection using the File API:

localfile = Algorithmia.client().file('data://user/collection/file.jpg').getFile()

You can then call .name on the local to get the local filename (it'll usually be stored in the Algorithm's local /tmp directory) and work with the local file:

from PIL import Image
import Algorithmia
client = Algorithmia.client()
def apply(filename):
    localfilename = client.file(filename).getFile().name
    image = Image.open(localfilename)
    result = image.convert('L')
    result.save('result.png')

Lastly, you'll want to send the result file into a collection that the user can retrieve it from. All Algorithms have an automatic, per-user collection referred to by "data://.algo". Temporary data should go in "data://.algo/temp/", and permanent data should go in "data://.algo/perm/" (more info in the Hosted Data documentation). We'll copy our local result file up to the temporary collection using putFile -- and also use UUID to generate a unique filename (so we don't accidentally overwrite an existing file of the same name):

from PIL import Image
import Algorithmia
import uuid
client = Algorithmia.client()
def apply(filename):
    localfilename = client.file(filename).getFile().name
    image = Image.open(localfilename)
    result = image.convert('L')
    result.save('result.png')
    uniquename = str(uuid.uuid4())+'.png'
    client.file('data://.algo/temp/'+
uniquename).putFile('result.png')

Lastly, we should inform the user about where the result file is. While we can use the shortcut "data://.algo/temp" inside the Algorithm, tle user needs the full path including the algorithm's name, "data://.algo/authorname/algorithmname/temp/", so if my username was "demo" and the Algorithm was called "greyscaler", then my final code would be:

from PIL import Image
import Algorithmia
import uuid
client = Algorithmia.client()
def apply(filename):
    localfilename = client.file(filename).getFile().name
    image = Image.open(localfilename)
    result = image.convert('L')
    result.save('result.png')
    uniquename = str(uuid.uuid4())+'.png'
    client.file('data://.algo/temp/'+uniquename).putFile('result.png')
    return 'data://.algo/demo/greyscaler/temp/'
+uniquename

Loading Machine Learning Models or Persistent Data

The apply() function is run every single time your Algorithm is called, The code outside this function -- let's call it the global scope -- gets called much less often. Without delving into too much detail, if a user makes a series of successive (not parallel) calls to your Algorithm, your global code will be run exactly once, while your apply() will be run once per call.

For this reason, it is very important that operations which only need to occur once, and take up lots of time, be put in the global scope.  One such example is loading a large, serialized machine learning model. If I have a previously trained and saved a large model into one of my collections at data://user/collection/large_model.pkl, I'll load the model in the global scope, then make the actual predictions inside the apply method:

from sklearn.externals import joblib
client = Algorithmia.client()
m
odelfile = client.file('data://user/collection/large_model.pkl').getFile().name
model = joblib.load(modelfile)
def apply(input):
    return model.predict(input)

Sharing Data with your Team

If you have models or other data you want to share with others, such as your department or team (but don't want to be Public), Create or join an Organization. For each Org you are a member of, you'll see a separate collection in your Data Sources. Instead of placing data into your own personal collections (e.g. data://.my or data://user), put it into the Org's collection (data://organization) so all your group's members can access it. Any Algorithms inside that Organization will be able to access the Org-level data collections, so long as their access level is "My algorithms" or higher.

Ready for more? Next, learn about pipelining: calling one Algorithm from another.

 


NEXT