Model
A model is a program that has been trained on a set of data to recognize certain patterns or make certain decisions without further human intervention.
Model = Trained Algorithm + Data
Training vs Inference
AI Workflow:
Data Preperation
|--> Model Training
|--> Optimization
|--> Inference/Deployment
🦾 Model Training
- compute intensive
- Forward + backward pass
- Multi-GPU scaling
- High memory + compute demand
- Uses NCCL, NVLink, RDMA
🚀 Model Inference
Process of running unseen data through a trained AI model to make a prediction or solve a task
- latency optimized
- Forward pass only
- Lower latency focus
- Often containerized (Kubernetes)
Inferences
- Inference is an ML model in action.
| 🦾 Training | 🚀 Inference |
|---|---|
| Model learning | Model usage |
| High compute + memory | Lower latency focus |
| Batch workloads | Real-time workloads |
| Multi-GPU scaling | Edge + cloud deployment |
Quantization
Process of reducing numerical precision of model weight & activation
Reducing floating point precision from 32 bit to 8 bit:
- Improve latency
- Save Power
- Reduce memory usage
Precision vs Model Size vs Inference Performance
| Precision | Model Size | Inference Speed | Accuracy |
|---|---|---|---|
| 32-bit (FP32 / Full Precision) | 100 MB | 1x | 95% |
| 16-bit (FP16 / Half Precision) | 50 MB | 1.8x | 94.8% |
| 8-bit (INT8 Quantized) | 25 MB | 3x | 94% |
EDA (Exploratory Data Analysis) for AI Models
Process of analyzing and visualizing data to understand its characteristics before training an AI model.
First step in data analysis is to perform EDA to gain insights into the data and identify potential issues.
Used to first understand data before using it to find pattern, problems and features that can be used to train a model.
Common Techniques include:
1. N-Gram analysis
Capture longer context and relationships between words by analyzing sequences of n words (e.g., bigrams, trigrams).
- Unigram- single word
- Bigram- two words
- Trigram- three words
from sklearn.feature_extraction.text import CountVectorizer
# Example: Extract bigrams from text data
vectorizer = CountVectorizer(ngram_range=(2, 2))
bigrams = vectorizer.fit_transform(df['text_column'])
Example output:
[('machine learning', 100), ('artificial intelligence', 80), ('deep learning', 60), ('natural language', 50), ('neural networks', 40)]
2. Word frequency analysis
Identify the most common words in the dataset to understand prevalent themes and topics.
from collections import Counter
# Example: Count word frequencies
word_counts = Counter(" ".join(df['text_column']).split())
most_common_words = word_counts.most_common(10)
print(most_common_words)
Example output:
[('the', 500), ('and', 450), ('to', 400), ('is', 350), ('in', 300), ('it', 250), ('of', 200), ('was', 150), ('for', 100), ('with', 50)]
3. Descriptive Statistical analysis
Calculate summary statistics (e.g., mean, median, standard deviation) to understand the distribution of numerical features.
# Example: Calculate summary statistics for a numerical column
print(df['numerical_column'].describe())
Example output:
count 1000.000000
mean 50.123456
std 10.987654
min 20.000000
25% 40.000000
50% 50.000000
4. Data visualization (e.g., histograms, word clouds, scatter plots)
Use visualizations to explore data distributions and relationships between features.
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# Example: Create a word cloud
text = " ".join(df['text_column'])
wordcloud = WordCloud(width=800, height=400).generate(text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Example output: A word cloud visualization showing the most common words in the dataset, with larger words representing higher frequency.
Common EDA steps include:
1. Data Collection
- Gather relevant data for the task at hand.
- Example: For a sentiment analysis model, collect a dataset of text reviews labeled with sentiment (positive, negative, neutral).
2. Data Cleaning
- Remove duplicates, handle missing values, and correct errors in the data.
import pandas as pd
# Load dataset
df = pd.read_csv('reviews.csv')
# Remove duplicates
df = df.drop_duplicates()
# Handle missing values
df = df.fillna(method='ffill')
3. Data Visualization
- Use visualizations to understand data distribution and relationships.
import matplotlib.pyplot as plt
# Visualize sentiment distribution
df['sentiment'].value_counts().plot(kind='bar')
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()
4. Feature Engineering
- Create new features from existing data to improve model performance.
# Example: Create a feature for review length
df['review_length'] = df['review_text'].apply(len)
5. Data Splitting
- Split the dataset into training, validation, and test sets.
from sklearn.model_selection import train_test_split
# Split data into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
6. Model Selection
- Choose an appropriate model architecture based on the task and data characteristics.
from sklearn.linear_model import LogisticRegression
# Initialize a logistic regression model
model = LogisticRegression()
Training Selection
- Select a training algorithm and optimization method to train the model.
- Example: Use stochastic gradient descent (SGD) to optimize the model's parameters during training.
Common training algorithms include:
| Algorithm | Description |
|---|---|
| Stochastic Gradient Descent (SGD) | Iteratively updates model parameters based on a random subset of the training data. |
| Adam | An adaptive learning rate optimization algorithm that combines the benefits of both AdaGrad and RMSProp. |
| RMSProp | An optimization algorithm that adjusts the learning rate for each parameter based on the average of recent magnitudes of gradients. |
| Adagrad | An optimization algorithm that adapts the learning rate for each parameter based on the historical gradients. |
7. Model Training
- Train the model on the training data.
# Train the model
model.fit(train_df['review_text'], train_df['sentiment'])
8. Model Evaluation
- Evaluate the model's performance on the test set using appropriate metrics.
from sklearn.metrics import accuracy_score
# Predict on the test set
predictions = model.predict(test_df['review_text'])
# Calculate accuracy
accuracy = accuracy_score(test_df['sentiment'], predictions)
# Evaluate accuracy of the model on the test set
print(f"Accuracy: {accuracy}")
