Project Python Prediction
Introduction : We are going to create a python script that can predict models from a dataset. For the first time we will follow a tutorial to find the best predictive algorithm that can detect which type of flower is by some characteristics of her, and then we will try to adapt all these steps on our own dataset.
Technologies used : For this project we are going to start by working with python and some libraries to get our Dataset and then make some operations on it. And in a second time we will use pythoneverywhere to store it online and then use some HTML and Flask to have a web page which permits users to use our model.
I. Find the best model
1. Preparing environnement
In a first time we need to install Pip, Python 3.8 and then Scipy with 5 libraries
:
-
scipy
-
numpy
-
matplotlib
-
pandas
-
sklear
The command : python -m pip install --user numpy scipy matplotlib pandas sklearn
The result of the command :

In our new python project we need to make this imports.

2. Clean and Load The Data
​
We use a dataset that we found on Kagle, in this one we 345 raws of penguins specifications and their species. As some row values are missing we need to delete them to have the cleanest dataset possible. When it’s done we store the csv file on a github for our python script have access to it.
We load the data from the dataset on :
https://raw.githubusercontent.com/AntoineDEA/DataMiningV3/master/penguins_size.csv
In this dataset , we have 345 instances of penguins and their specifications.
3. Create validation dataset by splitting them

We need to know that the model we created is good, so we will split the loaded dataset into two, 80% of which we will use to train, evaluate and select among our models, and 20% that we will hold back as a validation dataset.
Now, we can train data in the X_train and Y_train for preparing models and X_validation and Y_validation sets that we can use later.
4. Build Models
Let’s test 6 different algorithms:
-
Logistic Regression (LR)
-
Linear Discriminant Analysis (LDA)
-
K-Nearest Neighbors (KNN).
-
Classification and Regression Trees (CART).
-
Gaussian Naive Bayes (NB).
-
Support Vector Machines (SVM).

6. Select best model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
In this case, we can see that it looks like Logistic Regression (LR) has the largest estimated accuracy score at about 0.98 or 98%.
A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions
7. Make predictions
​
The results in the previous section suggest that the LR was probably the most accurate model. We will use this model as our final model.
8. Display predictions

We can see that the accuracy is 0.966 or about 96% on the hold out dataset.
II. Creating Website for user access
The objective is to deploy the python script on a webpage. To do that, we upload the python code on the server.
we didn’t know how to mix html code with python code. In fact we have just to import a framework called Flask . And it can mix the both codes in the same. So we can obtain this type of code.

On this page we have a form HTML, we enter some data in input form and it’s sent to the python code. You can see below.

Once we have the input filled, we send the data on the python code on this page.

The code , catch the form’s data and apply on the algorithm methods, and display the likely species. In this case we obtained Adélie species with this characteristic.

We decided not to add the css file because the host cannot detect css file
​
Now you can test by yourself, click on the link below :
​