Comparing computer vision models

As newcomers in the world of deep learning, we are told that we can generally treat pre-trained computer vision models as ‘black boxes’, without understanding the inner workings of the models. On this page, we will compare the performance of some of the state-of-the-art computer vision models. In doing so, we gain a better mental map of what performance looks like on the cutting edge, as well as demonstrating some of the visualization tools of data analysis.

We use the state-of-the-art computer vision models from PyTorch Image Models (timm). Our benchmarks for these models were collected by Ross Wightman and come from his GitHub. Our analysis is based on Jeremey Howard’s orignal analysis.

Benchmarks

We download the benchmarks—in this case, CSV files—from Ross Wightman’s github.

import pandas as pd
# Clone Ross Wightman's repo
! git clone --depth 1 https://github.com/rwightman/pytorch-image-models.git
%cd pytorch-image-models/results
df_results = pd.read_csv('results-imagenet.csv')
# Merge CSV files together on 'model' column
def get_data(part, col):
    df = pd.read_csv(f'benchmark-{part}-amp-nhwc-pt111-cu113-rtx3090.csv').merge(df_results, on='model')
    df['secs'] = 1. / df[col]
    df['family'] = df.model.str.extract('^([a-z]+?(?:v2)?)(?:\d|_|$)')
    df = df[~df.model.str.endswith('gn')]
    df.loc[df.model.str.contains('in22'),'family'] = df.loc[df.model.str.contains('in22'),'family'] + '_in22'
    df.loc[df.model.str.contains('resnet.*d'),'family'] = df.loc[df.model.str.contains('resnet.*d'),'family'] + 'd'
    return df[df.family.str.contains('^re[sg]netd?|beit|convnext|levit|efficient|vit|vgg|swin')]
Cloning into 'pytorch-image-models'...
remote: Enumerating objects: 555, done.
remote: Counting objects: 100% (555/555), done.
remote: Compressing objects: 100% (395/395), done.
remote: Total 555 (delta 218), reused 323 (delta 154), pack-reused 0
Receiving objects: 100% (555/555), 2.39 MiB | 12.89 MiB/s, done.
Resolving deltas: 100% (218/218), done.
/kaggle/working/pytorch-image-models/results

We take a look at the first few rows of data for inference models.

df = get_data('infer', 'infer_samples_per_sec')
df.head()
model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size param_count_x top1 top1_err top5 top5_err param_count_y img_size crop_pct interpolation secs family
0 levit_128s 21485.80 47.648 1024 224 7.78 76.530 23.470 92.866 7.134 7.78 224 0.900 bicubic 0.000047 levit
1 regnetx_002 17821.98 57.446 1024 224 2.68 68.762 31.238 88.556 11.444 2.68 224 0.875 bicubic 0.000056 regnetx
2 regnety_002 16673.08 61.405 1024 224 3.16 70.252 29.748 89.540 10.460 3.16 224 0.875 bicubic 0.000060 regnety
3 levit_128 14657.83 69.849 1024 224 9.21 78.486 21.514 94.010 5.990 9.21 224 0.900 bicubic 0.000068 levit
4 regnetx_004 14440.03 70.903 1024 224 5.16 72.396 27.604 90.830 9.170 5.16 224 0.875 bicubic 0.000069 regnetx

Inference model performance

We plot the benchmarks for the inference models. In our chart: - the x axis shows how many seconds it takes to process one image in a log scale - the y axis is the accuracy on Imagenet - the size of each bubble is proportional to the size of images used in testing - the color shows what “family” the architecture is from.

The chart is interactive: - hover your mouse over a marker to see details about the model - double-click in the legend to display just one family - single-click in the legend to show or hide a family.

import plotly.express as px
w,h = 1000,800

def show_all(df, title, size):
    return px.scatter(df, width=w, height=h, size=df[size]**2, title=title,
        x='secs',  y='top1', log_x=True, color='family', hover_name='model', hover_data=[size])
show_all(df, 'Inference', 'infer_img_size')

We can easily restrict to a subset of the models to get a simpler plot. We distinguish between the covnext models trained on the 22,000 category imagenet sample, covnext_in22, vs those which haven’t been, covnext.

subset = 'levit|resnetd|resnet|regnety|vgg|swin'

This function overlays a linear fit for each family.

def show_subs(df, title, size):
    df_subs = df[df.family.str.fullmatch(subset)]
    return px.scatter(df_subs, width=w, height=h, size=df_subs[size]**2, title=title,
        trendline="ols", trendline_options={'log_x':True},
        x='secs',  y='top1', log_x=True, color='family', hover_name='model', hover_data=[size])
show_subs(df, 'Inference', 'infer_img_size')

Commentary on findings

The LeViT family models are both fast and accurate. Apparently these models are constructed using a hybrid of convolution neural networks and transformets.

The Swin family of transformers is apparently among the most accurate. It is described as a “hierarchical Transformer whose representation is computed with shifted windows.”

Speed vs parameter count

We finally compare speed vs parameter count. Often, parameter count is used in papers as a proxy for speed. However, as we see, there is a wide variation in speeds at each level of parameter count, so it’s really not a useful proxy. There is sometimes a correlation between parameter count and needed memory, but this is also not always so useful. In the following chart: - the x axis shows the parameter count in a log scale - the y axis shows the speed in seconds in a log scale.

px.scatter(df, width=w, height=h,
    x='param_count_x',  y='secs', log_x=True, log_y=True, color='infer_img_size',
    hover_name='model', hover_data=['infer_samples_per_sec', 'family']
)

Trained model performance

We’ll now replicate the above analysis for training performance. First we grab the data:

tdf = get_data('train', 'train_samples_per_sec')
tdf.head()
model train_samples_per_sec train_step_time train_batch_size train_img_size param_count_x top1 top1_err top5 top5_err param_count_y img_size crop_pct interpolation secs family
0 levit_128s 6303.14 80.293 512 224 7.78 76.530 23.470 92.866 7.134 7.78 224 0.900 bicubic 0.000159 levit
1 levit_128 4434.56 114.332 512 224 9.21 78.486 21.514 94.010 5.990 9.21 224 0.900 bicubic 0.000226 levit
3 levit_192 3823.94 132.765 512 224 10.95 79.842 20.158 94.786 5.214 10.95 224 0.900 bicubic 0.000262 levit
4 resnet18 3584.19 142.504 512 224 11.69 69.748 30.252 89.078 10.922 11.69 224 0.875 bilinear 0.000279 resnet
9 levit_256 2923.52 174.041 512 224 18.89 81.510 18.490 95.490 4.510 18.89 224 0.900 bicubic 0.000342 levit

Now we can repeat the same family plot we did above:

show_all(tdf, 'Training', 'train_img_size')

And we also look at a subset of models:

show_subs(tdf, 'Training', 'train_img_size')