Comparing computer vision models

As newcomers in the world of deep learning, we are told that we can generally treat pre-trained computer vision models as ‘black boxes’, without understanding the inner workings of the models. On this page, we will compare the performance of some of the state-of-the-art computer vision models. In doing so, we gain a better mental map of what performance looks like on the cutting edge, as well as demonstrating some of the visualization tools of data analysis.

We use the state-of-the-art computer vision models from PyTorch Image Models (timm). Our benchmarks for these models were collected by Ross Wightman and come from his GitHub. Our analysis is based on Jeremey Howard’s orignal analysis.

Benchmarks

We download the benchmarks—in this case, CSV files—from Ross Wightman’s github.

import pandas as pd
# Clone Ross Wightman's repo
! git clone --depth 1 https://github.com/rwightman/pytorch-image-models.git
%cd pytorch-image-models/results
df_results = pd.read_csv('results-imagenet.csv')
# Merge CSV files together on 'model' column
def get_data(part, col):
    df = pd.read_csv(f'benchmark-{part}-amp-nhwc-pt111-cu113-rtx3090.csv').merge(df_results, on='model')
    df['secs'] = 1. / df[col]
    df['family'] = df.model.str.extract('^([a-z]+?(?:v2)?)(?:\d|_|$)')
    df = df[~df.model.str.endswith('gn')]
    df.loc[df.model.str.contains('in22'),'family'] = df.loc[df.model.str.contains('in22'),'family'] + '_in22'
    df.loc[df.model.str.contains('resnet.*d'),'family'] = df.loc[df.model.str.contains('resnet.*d'),'family'] + 'd'
    return df[df.family.str.contains('^re[sg]netd?|beit|convnext|levit|efficient|vit|vgg|swin')]

Cloning into 'pytorch-image-models'...
remote: Enumerating objects: 555, done.
remote: Counting objects: 100% (555/555), done.
remote: Compressing objects: 100% (395/395), done.
remote: Total 555 (delta 218), reused 323 (delta 154), pack-reused 0
Receiving objects: 100% (555/555), 2.39 MiB | 12.89 MiB/s, done.
Resolving deltas: 100% (218/218), done.
/kaggle/working/pytorch-image-models/results

We take a look at the first few rows of data for inference models.

df = get_data('infer', 'infer_samples_per_sec')
df.head()

	model	infer_samples_per_sec	infer_step_time	infer_batch_size	infer_img_size	param_count_x	top1	top1_err	top5	top5_err	param_count_y	img_size	crop_pct	interpolation	secs	family
0	levit_128s	21485.80	47.648	1024	224	7.78	76.530	23.470	92.866	7.134	7.78	224	0.900	bicubic	0.000047	levit
1	regnetx_002	17821.98	57.446	1024	224	2.68	68.762	31.238	88.556	11.444	2.68	224	0.875	bicubic	0.000056	regnetx
2	regnety_002	16673.08	61.405	1024	224	3.16	70.252	29.748	89.540	10.460	3.16	224	0.875	bicubic	0.000060	regnety
3	levit_128	14657.83	69.849	1024	224	9.21	78.486	21.514	94.010	5.990	9.21	224	0.900	bicubic	0.000068	levit
4	regnetx_004	14440.03	70.903	1024	224	5.16	72.396	27.604	90.830	9.170	5.16	224	0.875	bicubic	0.000069	regnetx

Inference model performance

We plot the benchmarks for the inference models. In our chart: - the x axis shows how many seconds it takes to process one image in a log scale - the y axis is the accuracy on Imagenet - the size of each bubble is proportional to the size of images used in testing - the color shows what “family” the architecture is from.

The chart is interactive: - hover your mouse over a marker to see details about the model - double-click in the legend to display just one family - single-click in the legend to show or hide a family.

import plotly.express as px
w,h = 1000,800

def show_all(df, title, size):
    return px.scatter(df, width=w, height=h, size=df[size]**2, title=title,
        x='secs',  y='top1', log_x=True, color='family', hover_name='model', hover_data=[size])

show_all(df, 'Inference', 'infer_img_size')

We can easily restrict to a subset of the models to get a simpler plot. We distinguish between the covnext models trained on the 22,000 category imagenet sample, covnext_in22, vs those which haven’t been, covnext.

subset = 'levit|resnetd|resnet|regnety|vgg|swin'

This function overlays a linear fit for each family.

def show_subs(df, title, size):
    df_subs = df[df.family.str.fullmatch(subset)]
    return px.scatter(df_subs, width=w, height=h, size=df_subs[size]**2, title=title,
        trendline="ols", trendline_options={'log_x':True},
        x='secs',  y='top1', log_x=True, color='family', hover_name='model', hover_data=[size])

show_subs(df, 'Inference', 'infer_img_size')

Commentary on findings

The LeViT family models are both fast and accurate. Apparently these models are constructed using a hybrid of convolution neural networks and transformets.

The Swin family of transformers is apparently among the most accurate. It is described as a “hierarchical Transformer whose representation is computed with shifted windows.”

Speed vs parameter count

We finally compare speed vs parameter count. Often, parameter count is used in papers as a proxy for speed. However, as we see, there is a wide variation in speeds at each level of parameter count, so it’s really not a useful proxy. There is sometimes a correlation between parameter count and needed memory, but this is also not always so useful. In the following chart: - the x axis shows the parameter count in a log scale - the y axis shows the speed in seconds in a log scale.

px.scatter(df, width=w, height=h,
    x='param_count_x',  y='secs', log_x=True, log_y=True, color='infer_img_size',
    hover_name='model', hover_data=['infer_samples_per_sec', 'family']
)

Trained model performance

We’ll now replicate the above analysis for training performance. First we grab the data:

tdf = get_data('train', 'train_samples_per_sec')
tdf.head()

	model	train_samples_per_sec	train_step_time	train_batch_size	train_img_size	param_count_x	top1	top1_err	top5	top5_err	param_count_y	img_size	crop_pct	interpolation	secs	family
0	levit_128s	6303.14	80.293	512	224	7.78	76.530	23.470	92.866	7.134	7.78	224	0.900	bicubic	0.000159	levit
1	levit_128	4434.56	114.332	512	224	9.21	78.486	21.514	94.010	5.990	9.21	224	0.900	bicubic	0.000226	levit
3	levit_192	3823.94	132.765	512	224	10.95	79.842	20.158	94.786	5.214	10.95	224	0.900	bicubic	0.000262	levit
4	resnet18	3584.19	142.504	512	224	11.69	69.748	30.252	89.078	10.922	11.69	224	0.875	bilinear	0.000279	resnet
9	levit_256	2923.52	174.041	512	224	18.89	81.510	18.490	95.490	4.510	18.89	224	0.900	bicubic	0.000342	levit

Now we can repeat the same family plot we did above:

show_all(tdf, 'Training', 'train_img_size')

And we also look at a subset of models:

show_subs(tdf, 'Training', 'train_img_size')