How To Deploying A Model

Özge Karalı
4 min readOct 16, 2023

--

with Google Cloud Ai Platform

We will look at how to deploy models, first to TF Serving, second to Google Cloud Ai Platform.
If you want to put a model in production that makes gorgeous prediction,you need to be running the model on a batch of data.Even
you can write a script that will run at times you specify.You may decide to wrap to use the model on live data.So you may query your model using a Rest Apı but you must retrain regularly with data who is updated.
To push the updated version to production.After than you must transition new model.Run multiple different models in parallel to perform A/B experiments.If your product becomes success, your service may start to get plenty of queries per second.This cause the load on the service,you must increse scale too.And you may use GPUs and TPUs to be more fast that you want to train many times.

TensorFlow Serving
It is making it switch model versions or scale the service,A/B experiments, and ensure that all your software components rely on the same model versions.It makes simply testing and development,can sustain a high load, serve multiple versions of your models.And watch a model repository to automatically deploy the latest versions.

TensorFlow ensures tf.saved_model.save() function to export models.Your need is to give function the model,representing its name and version number, and the function will save the model’s computation graph and its weights.Include all the preprocessing layers in th model you export that it can ingest data in its natural form it is deployed to production.This avoids having to take care of preprocessing one by one within the application that uses the model.And then you can load a SavedModel using the tf.saved_model.load() function.Suppose you have a Numpy array (x) containing three images that you want to make predictions for.You first need to export them to Numpy’s npy format.

Use the Docker option that offers high performance for installing TF Serving.Then download the official TF Serving Docker.You can create a Docker container.Load our model and it must be serving it through both gRPC and REST.(gRPC on port 8500 and Rest on port 8501)

Go back to Python and query this server.First using the REST API,then the gRPC API.

REST API

Start by creating the query.Note that the Json format is %100 tex-based, so the x_new Numpy array had to be converted to a Python list and then formatted as Json.Send the input data to TensorFlow Serving by sending an HTTP POST request.The response is a dictionary containing a single ‘prediction’ key that value is the list of predictions.You must convert it to a Numpy array.So we have the predictions.

However everytime using gRPC API is more usefull because of REST API’s problems.It is based on a compact binary format and an efficient communication protocol.

gRPC API

First create the request is PredictRequest protocol buffer that define the model name and signature name of function in form of a Tensor protocol buffer.The tf.make_make_tensor_proto() functions creates a Tensor protocol buffer based on the given tensor or Numpy array , in this case x_new.And you can send the request to the server.After the imports, we create a gRPC communication channel to localhost on TPC port 8500.We create a gRPC service over this channel and use it to send a request.It will avoid until it receives the response or a 10 second timeout period.And than convert the PredictResponse protocol to a tensor.You canaccess your TensorFlow model remotely, using either Rest or gRPC.

Prediction Service

You have to create GCP console account.When you create an account , GCP automatically creates a project for you.If you want to change its name that define automatically, you go the project settings,select IAM Admin and than you select settings, change the project’s name and save.Firstly you need is Google Cloud Storage, this is where you will put the Save Models, the training data.Scroll down to the Storage section,and click Storage and Browser.All your files will go in one.Click Create Bucket and choose the bucket name.Choose the location where you want the bucket to be hosted, and the rest of the opinions should be fine by default.Then click Create.Upload the model file folder you created earlier to your bucket.To do this ,just go to the GCP Browser, click the bucket, then drag and drop the model file folder from your system to the bucket.Your need is configure Al Platform that it want to use models and versions.Scroll down to Artificial Intelligence section,and click Create model.Fill in the model details and click Create.

That you have a model on the Al Platrom,you need to create a model version.Select the model you created in the list of models, then click Create Version and fill in the version details.Set the name,Python version,framework,ML runtime version,machine type,model path on GCS,scaling , and minimum number of TF Serving containers etc details and save.You must select automatic scaling,Al Platform will start more TF Serving conteiners when the number of queries per second increases, and it will load-balance the queries between them.If the QPS(Queries per second) goes down , it will stop containers automatically.

And than to use predict service it must be obtain a token.First you need to configure authentication and give your application the appropriate access rights on GCP.You can create a service account, go to IAM&admin -Service accounts, then click Create Service Account, fill in the form and click Create,you must give this account some access rights and select the ML Engine Developer role,Next click Create Key to export the service account’s private key,choose Json and click Create.This will download the private key in the form of a JSON file.You may write a simple script that will query the prediction service.(Google API Client Library-GOOGLE_APPLICATION_CREDENTIALS)You must create a resource object that wraps access to the prediction service.Write a small function that will use the resource object to call the prediction service and get the predictions back.Return a Numpy array containing the imput images and prepares a dictionary that the client library will convert to the Json format.Then it prepares a prediction request , and executes it,it raises an exception,it extracts the predictions for each instance and bundles them in in a Numpy array.

Thank you.

--

--