- Published on
Use Sklearn to Build Linear Regression Model
- Authors
- Name
- William Qian
- @william_z_qian
A linear regression model is possibly the first machine-learning model that most people build using scikit-learn. Here is a really simple demo.
Get Data
Of course, we have to get data for our model. In the following passage, we are going to use the dataset provided by UCI to demonstrate the topic.
Here is the description of our dataset: Dataset Description
And you can download the dataset here: Download Dataset
The data is gathered from a combined cycle power plant, which has 9568 samples and 5 features: AT (Temperature), V (Exhaust Vacuum), AP (Ambient Pressure), RH (Relative Humidity), and EP (Net hourly electrical energy output).
Define Questions
Our goal is quite simple here, we got a clear linear problem: EP is our output and the rest of the features are training features. Our model is going to be like this:
Load Data
Usually, we need to pre-process the data, however, our data here has already been cleaned.
Firstly, we need to import packages into the jupyter notebook (or directly into the .py file ).
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model, metrics
Then, we will use pandas to load data.
data = pd.read_csv('data.csv')
Check the first 5 rows of the data.
data.head()
| AT | V | AP | RH | EP | |
|---|---|---|---|---|---|
| 0 | 8.34 | 40.77 | 1010.84 | 90.01 | 480.48 |
| 1 | 23.64 | 58.49 | 1011.40 | 74.20 | 445.75 |
| 2 | 29.74 | 56.90 | 1007.15 | 41.91 | 438.76 |
| 3 | 19.07 | 49.69 | 1007.22 | 76.79 | 453.09 |
| 4 | 11.80 | 40.66 | 1017.13 | 97.20 | 464.43 |
Select Data
We may check our data's shape first.
data.shape
The result is (9568,5), which means we have 9568 samples and 5 features. However, we only need 4 for our training sample, meaning that we need to select them separately.
X = data[['AT', 'V', 'AP', 'RH']]
X.head()
| AT | V | AP | RH | |
|---|---|---|---|---|
| 0 | 8.34 | 40.77 | 1010.84 | 90.01 |
| 1 | 23.64 | 58.49 | 1011.40 | 74.20 |
| 2 | 29.74 | 56.90 | 1007.15 | 41.91 |
| 3 | 19.07 | 49.69 | 1007.22 | 76.79 |
| 4 | 11.80 | 40.66 | 1017.13 | 97.20 |
Then we select the y feature (EP).
y = data[['EP']]
y.head()
| EP | |
|---|---|
| 0 | 480.48 |
| 1 | 445.75 |
| 2 | 438.76 |
| 3 | 453.09 |
| 4 | 464.43 |
Split the Training and Testing dataset
The train_test_split function is a useful one to do the task.
The default train-test rate is 0.75:0.25
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Use scikit-learn to Build Linear Model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
Now print the model coefficients.
print(linreg.intercept_)
print(linreg.coef_)
[ 447.06297099]
[[-1.97376045 -0.23229086 0.0693515 -0.15806957]]
So our model is going to be like this:
Model Evaluation
For a linear model, MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) are usually used for evaluation.
y_pred = linreg.predict(X_test)
print("MSE:" + str(metrics.mean_squared_error(y_test, y_pred)))
print("RMSE:" + str(np.sqrt(metrics.mean_squared_error(y_test, y_pred))))
MSE: 20.0804012021
RMSE: 4.48111606657
Visualization
This graph shows the relationship between the true and predicted values, with the points closer to the middle line y=x directly representing the lower predicted loss.
fig, ax = plt.subplots()
ax.scatter(y, predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
