Image by Author | Created on Canva
Looking to further your data science skills? Building a data science app is a great way to learn more.
Building a data science application involves multiple steps—from data collection and preprocessing to model training and serving predictions via an API. This step-by-step tutorial will guide you through the process of creating a simple data science app.
We’ll use Python, scikit-learn, and FastAPI to train a machine learning model and build an API to serve its predictions. To keep things simple, we’ll use the built-in wine dataset from scikit-learn. Let’s get started!
▶️ You can find the code on GitHub.
Step 1: Setting Up the Environment
You should have a recent version of Python installed. Then, install the necessary libraries for building the machine learning model and the API to serve the predictions:
$ pip3 install fastapi uvicorn scikit-learn pandas
Note: Be sure to install the required libraries in a virtual environment for the project.
Step 2: Loading the Dataset
We will use scikit-learn’s wine dataset. Let’s load the dataset and convert it into a pandas dataframe for easy manipulation:
# model_training.py
from sklearn.datasets import load_wine
import pandas as pd
def load_wine_data():
wine_data = load_wine()
df = pd.DataFrame(data=wine_data.data, columns=wine_data.feature_names)
df['target'] = wine_data.target # Adding the target (wine quality class)
return df
Step 3: Exploring the Dataset
Before we proceed, it’s good practice to explore the dataset a bit.
# model_training.py
if __name__ == "__main__":
df = load_wine_data()
print(df.head())
print(df.describe())
print(df['target'].value_counts()) # Distribution of wine quality classes
Here, we perform a preliminary exploration of the dataset by displaying the first few rows, generating summary statistics, and checking the distribution of the output classes.
Step 4: Data Preprocessing
Next, we will preprocess the dataset. We split the dataset into training and test sets, and scale the features.
The preprocess_data
function does just that:
# model_training.py
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def preprocess_data(df):
X = df.drop('target', axis=1) # Features
y = df['target'] # Target (wine quality)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
return X_train_scaled, X_test_scaled, y_train, y_test
Feature scaling using StandardScaler ensures that all features contribute equally to the model training.
Step 5: Training the Logistic Regression Model
Let’s now train a LogisticRegression model on the preprocessed data and save the model to a pickle file. The following function train_model
does that:
# model_training.py
from sklearn.linear_model import LogisticRegression
import pickle
def train_model(X_train, y_train):
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Save the trained model using pickle
with open('classifier.pkl', 'wb') as f:
pickle.dump(model, f)
return model
Step 6: Evaluating the Model
Once the model is trained, we evaluate its performance by calculating the accuracy on the test set. To do so, let’s define the function evaluate_model
like so:
# model_training.py
from sklearn.metrics import accuracy_score
def evaluate_model(model, X_test, y_test):
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
if __name__ == "__main__":
df = load_wine_data()
X_train_scaled, X_test_scaled, y_train, y_test = preprocess_data(df)
model = train_model(X_train_scaled, y_train)
evaluate_model(model, X_test_scaled, y_test)
When you run the Python script: the data is loaded, preprocessed, the model is trained and evaluated. Running the script now gives:
Accuracy: 0.98
Step 7: Setting Up FastAPI
Now, we’ll set up a basic FastAPI application that will serve predictions using our trained model.
# app.py
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
def read_root():
return {"message": "A Simple Prediction API"}
In this step, we set up a basic FastAPI application and defined a root endpoint. This creates a simple web server that can respond to HTTP requests.
You can run the FastAPI app with:
uvicorn app:app --reload
Go to http://127.0.0.1:8000 to see the message.
Step 8: Loading the Model in FastAPI
We’ll load the pre-trained model within FastAPI to make predictions.
Let’s go ahead and define a function to load the pre-trained Logistic Regression model within our FastAPI application.
# app.py
import pickle
def load_model():
with open('model/classifier.pkl', 'rb') as f:
model = pickle.load(f)
return model
This means our model is ready to make predictions when requests are received.
Step 9: Creating the Prediction Endpoint
We’ll define an endpoint to accept wine features as input and return the predicted wine quality class.
Define Input Data Model
We’d like to create a prediction endpoint that accepts wine feature data in JSON format. The input data model—defined using Pydantic—validates the incoming data.
# app.py
from pydantic import BaseModel
class WineFeatures(BaseModel):
alcohol: float
malic_acid: float
ash: float
alcalinity_of_ash: float
magnesium: float
total_phenols: float
flavanoids: float
nonflavanoid_phenols: float
proanthocyanins: float
color_intensity: float
hue: float
od280_od315_of_diluted_wines: float
proline: float
Prediction Endpoint
When a request is received, the API uses the loaded model to predict the wine class based on the provided features.
# app.py
@app.post("/predict")
def predict_wine(features: WineFeatures):
model = load_model()
input_data = [[
features.alcohol, features.malic_acid, features.ash, features.alcalinity_of_ash,
features.magnesium, features.total_phenols, features.flavanoids,
features.nonflavanoid_phenols, features.proanthocyanins, features.color_intensity,
features.hue, features.od280_od315_of_diluted_wines, features.proline
]]
prediction = model.predict(input_data)
return {"prediction": int(prediction[0])}
Step 10: Testing the Application Locally
You can rerun the app by running:
uvicorn app:app --reload
To test the application, send a POST request to the /predict
endpoint with wine feature data:
curl -X POST "http://127.0.0.1:8000/predict"
-H "Content-Type: application/json"
-d '{
"alcohol": 13.0,
"malic_acid": 2.14,
"ash": 2.35,
"alcalinity_of_ash": 20.0,
"magnesium": 120,
"total_phenols": 3.1,
"flavanoids": 2.6,
"nonflavanoid_phenols": 0.29,
"proanthocyanins": 2.29,
"color_intensity": 5.64,
"hue": 1.04,
"od280_od315_of_diluted_wines": 3.92,
"proline": 1065
}'
Testing locally is important to ensure that the API works as intended before any deployment. So we test the application by sending a POST request to the prediction endpoint with sample wine feature data and get the predicted class.
{"prediction":0}
Wrapping Up
We’ve built a simple yet functional data science app.
After building a machine learning model with scikit-learn, we used FastAPI to create an API that accepts user input and returns predictions. You can try building more complex models, add features, and much more.
As a next step, you can explore different datasets, models, or even deploy the application to production. Read A Practical Guide to Deploying Machine Learning Models to learn more.
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.
This post was originally published on here