Diamond Price Modelling

Regression Modelling / Deep Neural Networks / Tensorflow
Project Overview
In my constant curious quest, I wanted to learn more about how I could use Regression Models to help me in my everyday life. I was intrigued by the idea of understanding whether a diamond was priced fairly or had a significant markup. This curiosity led me to delve into developing a regression model to predict the price of a diamond based on its key characteristics, such as carat weight, clarity, color, and cut. The goal was to gain insights into whether a diamond represented a good deal or not. Join me on this journey of exploring the intricate world of diamond pricing!
Data Overview
To ensure the development of an accurate model, it was crucial to access a substantial amount of diamond data. Fortunately, I discovered a dataset on Kaggle comprising approximately 54,000 rows. This dataset includes a wealth of information on diamond characteristics, such as cut, clarity, carat, color, price, and other dimensional data.
For my regression model, I opted for a Random Forest model, chosen for its seamless integration of both numerical and categorical variables, including cut, clarity, and color.

To initiate the process, I undertook data preprocessing, transforming categorical variables (Clarity, Color, Cut) into numeric values using one-hot encoding. Simultaneously, I divided the dataset into training and validation sets.
Solution Overview
Following the data pre-processing phase, I explored the correlations between the price and various diamond features. As anticipated, the price exhibited the strongest correlation with attributes defining the diamond's size, particularly carat, as well as the dimensions x, y, and z.
Solution Overview
Following the identification of feature correlations, I proceeded to normalize my data and conducted tests with four distinct models to compare their outputs. These models comprised:

1. A Linear Regression model exclusively focused on diamond carat.
2. A Linear Regression model utilizing all 27 features.
3. A Deep Neural Network (DNN) centered on diamond carat.
4. A Deep Neural Network considering all 27 features.

The architecture of my DNNs featured one input normalization layer, two dense hidden layers with 64 nodes each, utilizing the Rectified Linear Unit (ReLU) activation function, and a final dense layer with a single output neuron. Training occurred over 100 epochs with an 80/20 split between training and validation data.
Designing and training the model
My Neural Network model, trained on all features, demonstrated remarkable accuracy, yielding a Mean Absolute Error of $301.

The model's performance metrics were as follows:
- R²: 0.9131
- Adjusted R²: 0.9126
- MAE: $301.3201
- MSE: 1,362,687.4
- RMSE: 1,167.342

This endeavor marked a successful initial dive into leveraging models for practical applications in my daily life. Upon reflection of what improvements to make, I'd use a dataset that includes additional information about the origin of diamonds. External research revealed substantial price variations between lab-grown and naturally mined diamonds, with the latter exhibiting significant price differences based on their specific origin. Overall, this project taught me a lot about how to clean data, compare and how to compare and contrast models to get the best outcome.
Results