Table of Contents

Image Classification with Gemini Pro

In this tutorial, you’ll learn how to use the Gemini Pro generative model with the Google AI Python SDK (software development kit) to generate code for image classification in PyTorch. We’ll delve into the effectiveness of this generated code, particularly its capability to train on popular datasets like MNIST or CIFAR-10 and achieve decent classification accuracy. Additionally, the tutorial will feature a side-by-side comparison with ChatGPT-3.5, providing valuable insights into each model’s unique code generation abilities and performance nuances.

This lesson is the 3rd in a 6-part series on Gemini Pro:

To learn how to create image classification code in PyTorch using Gemini Pro and compare its performance with ChatGPT-3.5, just keep reading.

Introduction to Gemini Pro for Image Classification

In our previous tutorial, we explored the versatile Gemini Pro, a part of the Google AI Python SDK, focusing on image processing. We introduced Gemini Pro, analyzed the Python code it generated, and compared it with ChatGPT-3.5 and Bard. While Gemini Pro demonstrated proficiency in code generation, it encountered limitations in Google Colab compatibility and had issues with errors and overwriting variables. ChatGPT-3.5, in contrast, showed an edge in producing error-free, Colab-compatible code.

Figure 1 shows the Google AI Studio interface using the Gemini Pro model to generate image classification codes in the PyTorch framework.

Figure 1: Snapshot of Google AI Studio generating code for image classification in PyTorch using Gemini Pro (source: image by the Author).

Transitioning from Image Processing to Image Classification with Gemini Pro

Expanding from our previous exploration of image processing, we now turn our attention to image classification within the PyTorch framework using Gemini Pro. This tutorial will rigorously examine how Gemini Pro handles classifying images from renowned datasets like MNIST or CIFAR-10, available through Torchvision. We’ll delve into the model’s ability to manage training and testing, along with its effectiveness in generating vital performance metrics (e.g., True Positives, False Positives, and confusion matrices).

Comparative Analysis: Gemini Pro vs. ChatGPT-3.5 in Image Classification

In the second part of our exploration, we’ll conduct a comparative analysis between the neural networks generated by Gemini Pro and those by ChatGPT-3.5. This comparison will not only assess their innovative approaches in code generation but also evaluate which model achieves higher accuracy in image classification. Such an analysis will offer valuable insights into the capabilities and adaptability of each model in this specialized field of AI-driven image analysis.

Exploring the Variants: Gemini Pro and Gemini Pro Vision

As we know from earlier tutorials on Gemini at PyImageSearch, Deepmind released two Gemini variants, which allow users to choose between two distinct models: Gemini Pro and Gemini Pro Vision. For those interested in a deeper dive into Gemini Pro Vision, check out our comprehensive PyImageSearch tutorial titled Introduction to Gemini Pro Vision. Additionally, if you’re keen on understanding more about Gemini Pro and its performance in image processing, be sure to check out our previous tutorial, which offers valuable insights into its capabilities and comparison with other models.

Setting Up Gemini Pro for Generating Image Classification Code in PyTorch

Now, let’s dive into our latest blog post, where we’ll set up Gemini Pro and delve into its capabilities for image classification. We’ll walk through the code generation process for this task and also conduct a detailed comparison with ChatGPT-3.5. This will provide a clearer understanding of how these models stack up against each other in practical AI applications.

Setting Up Gemini Pro for Image Classification

As we previously set up in our tutorial, we’ll continue using the Google AI Python SDK, which grants access to various models, including Gemini Pro.

To obtain your API key, visit Google MakerSuite and sign in with your Google account. Once logged in, you’ll enter Google AI Studio, where you can generate your API key, following the steps provided there. This key is essential for programmatically accessing the Gemini Pro model and other resources offered by the SDK.

Here, you’ll find an option to generate your API key, as illustrated in Figure 2.

Figure 2: Snapshot of Google AI Studio demonstrating API key generation (source: image by the Author).

Once you’ve generated your API key, it’s important to copy and securely save it. This key will play a crucial role in your work with the Gemini Pro model, especially as you generate image processing code using the model. Keeping it in a safe place ensures you have continuous access to Gemini Pro’s functionalities.

Generating PyTorch Code for Image Classification with Gemini Pro

In this section, we step into the fascinating world of AI-driven code creation. Here, we utilize the Google AI Python SDK to prompt Gemini Pro into crafting PyTorch code for image classification, setting the stage for a compelling comparison with ChatGPT-3.5’s code generation.

This part of our exploration will not only showcase Gemini Pro’s abilities but also offer a side-by-side analysis with ChatGPT-3.5, highlighting the strengths and innovative approaches of each model in handling a complex task like image classification through PyTorch.

Preparing Your Development Environment for Gemini Pro

Step 1: Installing the Google Generative AI Library

We start by installing the google-generativeai library using pip that would allow us to interact with Google’s generative models and, especially, the Gemini Pro model in Python, as shown below:

!pip install -q -U google-generativeai

Line 1: Installs the google-generativeai library

Step 2: Importing Essential Python Packages

import textwrap
import google.generativeai as genai
from IPython.display import Markdown

Lines 1-3: Imports three key Python packages. textwrap is employed for its text manipulation capabilities, essential for formatting. google.generativeai, abbreviated as genai, forms the core module, offering a range of AI functionalities. Lastly, IPython.display‘s Markdown is included, primarily for enhancing the display of outputs within the Colab notebook. Together, these packages form the foundation for the code’s AI and display functionalities.

Step 3: Securely Configuring Your API Key

# Used to securely store your API key
from google.colab import userdata
# Or use `os.getenv(‘GOOGLE_API_KEY’)` to fetch an environment variable.

On Lines 5-9, the google.colab library’s userdata module is used to securely fetch the “GEMINI_API_KEY”, which is then stored in GOOGLE_API_KEY. An alternative method to retrieve the API key could be through os.getenv(‘GOOGLE_API_KEY’).

The script then uses genai.configure(api_key=GOOGLE_API_KEY) to set up the GenAI library with this API key, ensuring authenticated access to its functionalities. This approach is particularly beneficial in Google Colab notebooks for securely managing API keys.

Creating and Configuring the Gemini Pro Model

model = genai.GenerativeModel(“gemini-pro”)

On Line 11, an instance of the GenerativeModel class is created using the genai library, specifically initializing it with the “gemini-pro” model. This action assigns the Gemini Pro model to the model variable, enabling its application in various AI-driven activities (e.g., text generation and data analysis). This step is crucial for leveraging Gemini Pro’s functionalities within the script.

Here, we’re opting to use the default settings of the GenerativeModel, as we’re not specifying any optional parameters (e.g., generation_config and safety_settings). This approach simplifies the setup and allows us to utilize the model’s built-in configurations.

Enhancing Code Presentation with Markdown

def to_markdown(text):
text = text.replace(“•”, ” *”)
return Markdown(textwrap.indent(text, “> “, predicate=lambda _: True))

Lines 13-15 introduce a to_markdown helper function, transforming a string into Markdown format, ideal for Jupyter notebooks. It starts by converting bullet points into Markdown’s asterisk syntax, followed by indenting each line with a blockquote symbol using textwrap.indent, applying this uniformly across all lines.

The final output is a Markdown object, well-suited for display in Markdown-compatible environments. This enhances the text’s presentation, making it more suitable for environments like Jupyter notebooks that support Markdown rendering.

Generating PyTorch Code for Image Classification

response = model.generate_content(
“Write a image multiclass classification code in pytorch framework using a public dataset”
” I would be training and testing the image classification code in Google colab”,

On Lines 17-23, the script employs model.generate_content to create code based on a specific prompt about writing multiclass classification code in the PyTorch framework using a public dataset intended for use in Google Colab. The stream=True setting indicates that the model’s response is streamed, and to_markdown(response.text) is used afterward to convert the text response into Markdown format. This enhances the text’s clarity and layout, making it well-suited for display in Markdown-friendly environments, like Jupyter notebooks.

# Import Necessary Libraries
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms, datasets

# Load the Public Dataset
train_data = datasets.ImageFolder(root=’path/to/train’, transform=transforms.ToTensor())
test_data = datasets.ImageFolder(root=’path/to/test’, transform=transforms.ToTensor())

# Initialize the DataLoaders
train_loader =, batch_size=32, shuffle=True)
test_loader =, batch_size=32, shuffle=True)

# Define the Convolutional Neural Network (CNN) Model
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
# Convolutional Layers
self.conv1 = nn.Conv2d(3, 6, 5) # Input channels: 3 (RGB), Output channels: 6, Kernel size: 5×5
self.pool = nn.MaxPool2d(2, 2) # Kernel size: 2×2, Stride: 2
self.conv2 = nn.Conv2d(6, 16, 5) # Input channels: 6, Output channels: 16, Kernel size: 5×5
self.pool = nn.MaxPool2d(2, 2) # Kernel size: 2×2, Stride: 2

# Fully Connected Layers
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):
# Pass the input through the convolutional layers
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))

# Flatten the feature maps
x = x.view(x.size(0), -1)

# Pass the flattened feature maps through the fully connected layers
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

# Initialize the CNN Model and Specify the Loss Function and Optimizer
model = CNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Train the Model
for epoch in range(20): # Adjust the number of epochs as needed
running_loss = 0.0
for i, data in enumerate(train_loader, 0):
# Get the inputs; data is a list of [inputs, labels]
inputs, labels = data

# Zero the parameter gradients

# Forward pass
outputs = model(inputs)

# Calculate the loss
loss = criterion(outputs, labels)

# Backward pass

# Update the parameters

# Print statistics
running_loss += loss.item()
if i % 2000 == 1999: # Print every 2000 mini-batches
print(‘[%d, %5d] loss: %.3f’ %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0

# Test the Model
correct = 0
total = 0
with torch.no_grad():
for data in test_loader:
images, labels = data
outputs = model(images)
_, predicted = torch.max(, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

print(‘Accuracy of the network on the 10000 test images: %d %%’ % (
100 * correct / total))

# Save the Trained Model, ‘’)

Upon reviewing the PyTorch code generated by Gemini Pro for image classification, it generally aligns well with best practices and shows a structured approach. However, several critical areas for improvement were identified, particularly those that could lead to code errors:

Use of Public Dataset: The code does not currently utilize a specified public dataset like MNIST or CIFAR-10, as required by the prompt. Incorporating one of these datasets using Torchvision’s dataset utilities would align the script with the prompt’s requirements and is a significant oversight.Missing import torch.nn.functional as F: The F.relu function is used in the forward method, but import torch.nn.functional as F is not included at the beginning of the script.Dataset Path Specifications: Incorrect dataset paths will impede the model’s ability to train and test properly. Be sure to replace ‘path/to/train’ and ‘path/to/test’ with the actual paths to your train and test datasets.

Other points for improvement, while important, are less likely to cause immediate functional errors:

Duplicate Pooling Layer: The self.pool layer is defined twice in the CNN class. While this doesn’t cause a functional error (since it’s the same operation repeated), it’s redundant and can be defined just once.Output Layer Dimension: The output layer self.fc3 in the CNN class has 10 neurons, which implies that the model is designed for a dataset with 10 classes. Ensure this aligns with the number of classes in your dataset (e.g., MNIST or CIFAR-10).Flattening Operation: The flattening operation in the forward method (x = x.view(x.size(0), -1)) assumes a specific size of the feature maps after the convolutional layers. Be sure that the size calculation (16 * 5 * 5) correctly matches the output size of the last convolutional layer.Print Statement in Training Loop: The conditional if i % 2000 == 1999 in the training loop might not be reached, depending on the size of your dataset and batch size. Adjust this condition to suit the number of batches in your training data.Testing Accuracy Print Statement: The message ‘Accuracy of the network on the 10000 test images: %d %%’ assumes there are 10,000 test images. This should be modified to reflect the actual size of your test dataset.

Addressing these areas is essential to enhance the model’s accuracy and functionality.

Downloading to ./data/cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:05<00:00, 29302880.52it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Epoch 1, Loss: 2.213340243262708
Epoch 2, Loss: 1.614285399999155
Epoch 3, Loss: 1.4277218132067824
Epoch 4, Loss: 1.3255487715496737
Epoch 5, Loss: 1.2489837502579555
Epoch 6, Loss: 1.1976356018534706
Epoch 7, Loss: 1.1457488671745486
Epoch 8, Loss: 1.097599425660375
Epoch 9, Loss: 1.0516416126352441
Epoch 10, Loss: 1.0136031122768627
Accuracy of the network on the 10000 test images: 57 %

After running the corrected PyTorch code on the CIFAR-10 dataset and adding the necessary import statement, the output indicates a successful model training and testing process.

The logs indicate that the CIFAR-10 dataset was successfully downloaded, extracted, and verified. This process involves retrieving the dataset from an online source, saving it locally, and then extracting the contents for use in the model training and testing. The confirmation of the files being “already downloaded and verified” suggests that the dataset was set up correctly for the subsequent steps in the code.

The model’s training loss decreased progressively over 10 epochs, indicating improvement in learning. However, the final accuracy on the test set was 57%, suggesting that while the model learned from the data, its predictive performance was poor.

The 57% low accuracy of the network on the CIFAR-10 dataset could primarily be attributed to the model’s architecture. While factors like hyperparameter settings also play a role, the architecture is a key determinant in how well the model can learn and generalize from the training data. Adjusting the model’s structure may be necessary for significant improvements in accuracy.

Exploring the Differences in Code Generation for Image Classification Between ChatGPT-3.5 and Gemini Pro

In this section, we shift gears to examine how ChatGPT-3.5 handles the same image classification task. Here, we provide ChatGPT-3.5 with the identical prompt given to Gemini Pro and observe its approach to generating PyTorch code for image classification. The focus is not only on the code generation process but also on how well the model performs on test data. This comparison aims to highlight the differences in efficiency, effectiveness, and accuracy between the two AI models in tackling a similar challenge.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Check if a GPU is available, if not, use CPU
device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)

# Define transformations for data augmentation and normalization
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),

transform_test = transforms.Compose([
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),

# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root=’./data’, train=True,
download=True, transform=transform_train)
trainloader =, batch_size=64,
shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root=’./data’, train=False,
download=True, transform=transform_test)
testloader =, batch_size=64,
shuffle=False, num_workers=2)

classes = (‘plane’, ‘car’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’)

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.fc1 = nn.Linear(128 * 8 * 8, 512)
self.fc2 = nn.Linear(512, 10)

def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = self.pool(torch.relu(self.conv2(x)))
x = x.view(-1, 128 * 8 * 8)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x

net = Net()

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

# Training loop
for epoch in range(10): # You can adjust the number of epochs as needed
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data
inputs, labels =,

outputs = net(inputs)
loss = criterion(outputs, labels)

running_loss += loss.item()

print(f”Epoch {epoch+1}, Loss: {running_loss / len(trainloader)}”)

print(“Training finished”)

all_labels = []
all_predictions = []

with torch.no_grad():
for data in testloader:
inputs, labels = data
inputs, labels =,

outputs = net(inputs)
_, predicted = torch.max(outputs, 1)


accuracy = accuracy_score(all_labels, all_predictions)
confusion = confusion_matrix(all_labels, all_predictions)
classification_rep = classification_report(all_labels, all_predictions, target_names=classes)

print(f”Accuracy: {accuracy * 100:.2f}%”)
print(“Confusion Matrix:”)
print(“Classification Report:”)

The code generated by ChatGPT-3.5 for image classification in PyTorch appears comprehensive and well-structured. Let’s do a detailed review.

Setting Up Your Environment and Data for Image Classification

Library Import and Device Setup: Correctly imports necessary libraries and optimizes for GPU usage.Data Augmentation and Normalization: Implements effective techniques for training and testing datasets.Dataset Loading and DataLoader Initialization: Accurately loads and prepares the CIFAR-10 dataset for both training and testing.

Training Models and Understanding the Architecture

Neural Network Architecture: Features a more complex CNN architecture than Gemini Pro, potentially enhancing learning.Training Loop: Well-structured with loss calculation, optimizer steps, and torch.relu for activation.GPU Utilization: Efficiently uses GPU for training and testing, boosting performance.

Evaluating Model Performance and Analyzing Results

Evaluation Metrics: Evaluates the model with accuracy, confusion matrix, and classification report.Final Performance Metrics: Offers a detailed analysis of model performance with accuracy, confusion matrix, and classification report.

Detailed Comparison: ChatGPT-3.5 vs. Gemini Pro for Image Classification

Public Dataset Usage: Unlike Gemini Pro, ChatGPT-3.5’s code correctly incorporates the CIFAR-10 dataset.Data Augmentation: ChatGPT-3.5 includes data augmentation, which is absent in Gemini Pro’s code.Complex Network Architecture: ChatGPT-3.5’s network is more intricate, suggesting improved learning capabilities.Detailed Performance Metrics: Provides a more comprehensive performance evaluation than Gemini Pro.

Overall, ChatGPT-3.5’s approach to image classification showcases a well-rounded and potentially more effective solution than Gemini Pro, particularly in terms of dataset handling, model complexity, and depth of performance analysis.

Downloading to ./data/cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:03<00:00, 43389823.48it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Epoch 1, Loss: 1.656345052792288
Epoch 2, Loss: 1.2884532127081585
Epoch 3, Loss: 1.0894466095873157
Epoch 4, Loss: 0.9669426654458351
Epoch 5, Loss: 0.8824315952218097
Epoch 6, Loss: 0.8171949713583797
Epoch 7, Loss: 0.7658578648667811
Epoch 8, Loss: 0.7242285415644536
Epoch 9, Loss: 0.6837977789856894
Epoch 10, Loss: 0.6462546251618954
Training finished
Accuracy: 77.29%
Confusion Matrix:
[[796 9 51 27 9 19 13 5 41 30]
[ 14 875 6 12 2 5 9 1 19 57]
[ 46 2 653 52 60 77 80 21 3 6]
[ 16 5 45 564 34 227 92 8 2 7]
[ 13 1 50 54 726 43 69 41 3 0]
[ 8 2 20 131 31 754 29 22 1 2]
[ 4 1 28 43 19 30 873 0 2 0]
[ 6 0 33 32 40 90 9 786 0 4]
[ 50 16 11 15 4 17 7 5 857 18]
[ 31 51 5 11 4 12 9 11 21 845]]
Classification Report:
precision recall f1-score support

plane 0.81 0.80 0.80 1000
car 0.91 0.88 0.89 1000
bird 0.72 0.65 0.69 1000
cat 0.60 0.56 0.58 1000
deer 0.78 0.73 0.75 1000
dog 0.59 0.75 0.66 1000
frog 0.73 0.87 0.80 1000
horse 0.87 0.79 0.83 1000
ship 0.90 0.86 0.88 1000
truck 0.87 0.84 0.86 1000

accuracy 0.77 10000
macro avg 0.78 0.77 0.77 10000
weighted avg 0.78 0.77 0.77 10000

The initial results from running the ChatGPT-3.5 generated code show a well-managed process, with the CIFAR-10 dataset being downloaded, extracted, and validated accurately. The training demonstrated a consistent reduction in loss across 10 epochs, indicating effective learning.

In terms of evaluation, the model attained a notable 77.29% accuracy on the test set, which is considerably higher than Gemini Pro’s achievement of 57% accuracy. This significant difference underscores the effectiveness of ChatGPT-3.5’s approach. Additionally, the code included comprehensive evaluation metrics (e.g., confusion matrix and classification report), offering an in-depth understanding of the model’s performance across various classes.

It’s noteworthy that the code from ChatGPT-3.5 was executed without any human modifications, demonstrating its robustness and reliability. This contrasts with the Gemini Pro code, which required specific fixes, such as correcting an import error, adding the CIFAR-10 dataset, and modifying epoch reporting for it to function correctly. This comparison underscores ChatGPT-3.5’s proficiency in generating ready-to-use, reliable code for complex tasks like image classification.

Summary and Key Takeaways

This comprehensive post explores image classification using Gemini Pro and its comparison with ChatGPT-3.5. Initially, it covers the setup of Gemini Pro and then delves into the use of Gemini Pro for image classification, revealing its limitations and the need for code adjustments. Gemini Pro’s generated code, while fundamentally sound, required modifications for integrating the CIFAR-10 dataset, fixing import errors, and correcting print statements in training loops. These adjustments were essential for the model to achieve a moderate 57% accuracy rate.

Contrastingly, ChatGPT-3.5’s code for a similar task demonstrated its robustness by requiring no alterations and achieving a higher accuracy rate of 77.29%. This notable difference in performance and the readiness of the code highlight ChatGPT-3.5’s advanced capabilities in creating efficient, accurate code for complex AI tasks, marking an area for improvement in Gemini Pro’s code generation process.

Citation Information

Sharma, A. “Image Classification with Gemini Pro,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, and R. Raha, eds., 2024,

author = {Aditya Sharma},
title = {Image Classification with Gemini Pro},
booktitle = {PyImageSearch},
editor = {Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha},
year = {2024},
url = {},

