13 minute read

A series of posts related to an art auction price model project.

Art Auction Data: Exploratory Data Analysis

I’m working towards an ML project that models painting prices in the secondary art market based on a variety of artwork features.

A robot art auctioneer, created by DALL-E 2

I’ve finished some of the heavy (and not-so-heavy) lifting on the front end–writing a script to scrape the data, and pre-processing and cleaning the scraped data, which contains 53,034 auction result records for some 141 artists.

Now it’s time to embark on some good, old-fashioned EDA so I can get a better sense of what we’re working with here and gain a little intuition about how realized auction price may or may not correlate with some of the features that I’ve either scraped or engineered.

The Dataset and its Features

Here’s a sample of what the dataset looks like:

data.sample(10, random_state=123)
artist_name title date medium dims auction_date auction_house auction_sale auction_lot price_realized ... auction_year price_realized_USD_constant_2022 area_cm_sq volume_cm_cu living years_after_death_of_auction artist_age_at_auction artist_age_at_artwork_completion artwork_age_at_auction years_ago_of_auction
31937 Salvador Dali The persistence of memory ,\nThe persistence o... NaN tapestry 162.56 x 140.97 cm Aug 12, 2017 Michaan's Auctions • Alameda August Estate Auction Lot420 US\$750 ... 2017 8.954441e+02 22916.083200 NaN 0 28.0 NaN NaN NaN 6
7514 Gerhard Richter ABSTRAKTES BILD 802-3 NaN oil on canvas 112 by 102 cm Sep 30, 2018 Sotheby's NaN Lot1079 HK\$27,720,000 • US\$3,540,939 ... 2018 4.126820e+06 11424.000000 NaN 1 NaN 86.0 NaN NaN 5
18591 Ed Ruscha Regal 2001 dry pigment and acrylic on museum board 101.9 x 152.4 cm May 16, 2019 Christie's • New York Post-War and Contemporary Art Morning Session Lot635 US\$927,000 ... 2019 1.061153e+06 15529.560000 NaN 1 NaN 82.0 64.0 18.0 4
39383 Takashi Murakami Superflat Monogram 2003 acrylic on canvas mounted on board 178.31 x 178.31 in May 15, 2008 Phillips de Pury & Company• New York Contemporary Art Part I Lot110 US\$724,200 ... 2008 9.843836e+05 205125.112975 NaN 1 NaN 46.0 41.0 5.0 15
35906 Bernard Buffet L'atelier 1949 oil on canvas 66.04 x 91.44 in May 1, 1996 Christie's • New York Impressionist & Modern Paintings, Drawings & S... Lot246 US\$38,000 ... 1996 7.087884e+04 38959.261436 NaN 1 NaN 68.0 21.0 47.0 27
21611 Chu Teh-Chun Mirage èa l'aube (Mirage at Dawn) 1920-2014 oil on canvas 130 x 195 cm May 30, 2015 Christie's • Hong Kong, HKCEC Grand Hall Asian 20th Century & Contemporary Art (Evening... Lot60 HK\$15,640,000 • US\$2,026,991 ... 2015 2.502812e+06 25350.000000 NaN 0 1.0 NaN NaN NaN 8
25757 Keith Haring Untitled 1983 Oil on panel 88.9 x 200.03 in Dec 8, 1998 Binoche • Paris Binoche Lot42 NaN ... 1998 NaN 114726.654417 NaN 0 8.0 NaN 25.0 15.0 25
3388 Andy Warhol Key Service (Positive) 1986 synthetic polymer and silkscreen ink on canvas 50.8 x 40.6 cm Nov 12, 2012 Christie's Andy Warhol at Christie's Sold to Benefit the ... Lot292 US\$62,500 ... 2012 7.966644e+04 2062.480000 NaN 0 25.0 NaN 58.0 26.0 11
20841 Pierre-Auguste Renoir Paysage aux collettes NaN oil on canvas 45.72 x 30.48 in May 28, 1997 Bukowskis• Stockholm International Auction Lot303 SEK695,000 • US\$90,541 ... 1997 1.650921e+05 8990.598793 NaN 0 78.0 NaN NaN NaN 26
48705 Julian Schnabel Untitled 1981 ink, sand and gesso on torn paper 96.52 x 124.46 in Oct 14, 1998 Christie's • Los Angeles 20th Century & Contemporary Art Lot100 US\$17,000 ... 1998 3.052230e+04 77502.291447 NaN 1 NaN 47.0 30.0 17.0 25

10 rows × 45 columns

And here’s some background on the features it includes:

Scraped Features

Artwork Information

-artist_name: The artist’s name, as it appears in the original list of artist names input into the scraping script

  • title: The artwork’s title
  • medium: The artwork’s medium
  • date: The artwork’s attributed date, which in some cases is a span (e.g., 1956-1958) or an estimate (e.g., 1920s, circa 1940s-1950s, 16th century, etc.)
  • dims: Artwork’s attributed dimensions. Because these are paintings, in most cases there are two measurements for width and height, but in some cases objects include a depth measurement or a radius measurment (for circular works).

Auction Information

  • auction_date: Date of auction, in Month DD, YYYY format
  • auction_house: Name of auction house (e.g., Sotheby’s)
  • auction_sale: Name of sale (e.g., Contemporary Evening Sale)
  • auction_lot: Number of auction lot
  • price_realized: Realized price in nominal currency. Includes transaction currency and, if not USD, conversion to USD
  • estimate: Range of auction house estimate for the work.
  • bought_in: Whether or not work was bought in (i.e., artwork goes unsold)

Merged Features

The following features are merged from the Museum of Modern Art’s collection dataset:

  • Nationality: The artist’s nationality
  • Gender: The artist’s gender
  • birth_year: Year of the artist’s birth
  • death_year: Year of the artist’s death (when applicable)

Parsed Features


  • auction_date_parsed: Conversion of date field to DateTime object
  • start_date: Year in which artwork was begun (identical to end_date in cases where date is a single year)
  • end_date: Year in which artwork was completed (identical to start_date in cases where date is a single year)


  • dims_cm: Extraction from dims of measurements denominated in cm
  • dims_mm: Extraction from dims of measurements denominated in mm
  • dims_in: Extraction from dims of measurements denominated in in
  • is_diameter: Boolean for whether a given measurement is indicated to be a diameter
  • width_cm: Width measurement extracted from dims_cm or computed from dims_mm or dims_in
  • height_cm: Height measurement extracted from dims_cm or computed from dims_mm or dims_in
  • depth_cm: Depth measurement extracted from dims_cm or computed from dims_mm or dims_in
  • width_mm: Width measurement extracted from dims_mm
  • height_mm: Height measurement extracted from dims_mm
  • depth_mm: Depth measurement extracted from dims_mm
  • width_in: Width measurement extracted from dims_in
  • height_in: Height measurement extracted from dims_in
  • depth_in: Depth measurement extracted from dims_in

Auction Information

  • auction_house_loc: Location of auction (when applicable), as extracted from auction_house
  • auction_house_name: Name of auction house, extracted from auction_house
  • price_realized_USD: Nominal USD realized price, extracted from price_realized
  • auction_year: Year, reformatted from auction_date

Engineered Features

Auction Information

  • price_realized_USD_constant_2022: Conversion of price_realized_USD to constant 2022 dollars using cpi library


  • area_cm_sq: Artwork size as surface area, computed from width_cm and height_cm (or width_cm if a diamter measurement)
  • volume_cm_cu: Artwork size as volume for three-dimensional works, computer from width_cm, height_cm, and depth_cm


  • living: Boolean for whether an artist was living at the time of auction
  • years_after_death_of_auction: Number of years after artist’s death that the auction occurred (in cases when the artist was no longer alive at the time of auction)
  • artist_age_at_auction: Artist’s age at the time of auction (in cases where artist was living at the time of auction)
  • artist_age_at_artwork_completion: Artist’s age at the time of artwork’s completion. Proxy for stage of artist’s career.
  • artwork_age_at_auction: Age of artwork in years at time of auction
  • years_ago_of_auction: Years elapsed from auction to present

Note that I won’t be working with most of the raw, scraped features.

A Note on Methodology: Constant vs. Nominal Dollars

In most of what follows, I’ve decided to do preliminary data analysis for patterns and trends using constant 2022 dollars rather than nominal dollars from each observation’s given auction year. My reason for doing this is to eliminate the inflation variable as much as possible so that we can attempt to measure realized price accoring to a single standard. Otherwise, any attempt to look for correlations between a certain variable and price realized would be confounded by auction date. For instance, consider an artwork sold in 1989 for a relatively high price and an artwork solid in 2020 for a relatively low price: due to inflaction, these two prices might be the same, and we will have lost the ability to see their difference. We want to eliminate this possibility to the extent that we can.

Not always, though. Ultimately I do want my model to predict prices in nominal amounts–that is, I want the model to predict the price for a work sold in 1990 in nominal 1990 dollars. But again, my sense is that I’ll have an easier time understanding general trends and patterns in the data if I adjust for inflation. As a result, I’ll use the price_realized_USD_constant_2022 feature that I engineered so that I’m dealing with constant 2022 USD amounts.


1. Price Correlates Strongly with Artist Name

Based on some limited domain experience, my first intuition is that artist name will be the single most important factor in determining price. Which makes sense: Warhol will fall into one price bracket, while new MFAs will fall into another price bracket.

Let’s take a look at the artists most represented in this dataset by auction count, and then compare their realized price distributions (using constant 2022 dollars):

Price correlation with artist name

The first thing to note here is that evidently auction results are not distributed normally–not be a long shot. The realized price distribution for all the artists here, Warhol and Picasso in particular, have an aggressive positive skew, with a huge number of outliers–including (I was shocked to discover) a Warhol work that sold for close to $200M.

Let’s check again without the fliers.

Price correlation with artist name, no fliers

Once we get rid of the outliers, we can see more clearly just how much variance there is from one artist market to the next.

2. Individual Artist Markets Vary

Another way we can examine this question of individual artist markets is to look at whether the correlations between realized price and certain features–painting size, for instance, or painting age at the time of auction–behave differently according to artist. In other words, perhaps for some artists size correlates strongly with realized price while for others it may not. Again, in order to resolve the inflation issue (since we’re looking at auction results from a nearly 40-year period), I’ll use constant 2022 dollars as the target variable.

Heatmap of price correlation with artist name

Interestingly, we can see that, for an artist like Damien Hirst, size (width in particular) correlates relatively strongly with realized price, while for an artist like Bernard Buffet or Sam Francis, the correlation is much less pronounced. We can also see that for an artist like Zao Wou-Ki, realized price increases with the artist’s age, whereas for Francis or Buffet, the opposite is true.

While the artist’s name can of course be included in the model as a feature, to keep things simple for starters my approach will be to try to model an individual artist first. Intuitively this feels especially important since some features are correlated positively for certain artists and negatively for others.

3. Prices are Logarithmic

Because the realized price for artworks has such an aggressively positive skew, it turns out looking at the log of realized price effectively normalizes the distribution.

Prices on a log scale

4. Artwork Size is Logarithmic, Too

Artwork size (width, height, and area) has a similar positive skew which can be remedied with a logarithmic scale.

Artwork Size

Compare that with the same distributions on logarithmic scales.

Artwork Size on log scale

5. Artwork Size and Price Have a Moderate Posive Correlation

Knowing that artwork dimensions and price need to be plotted on logarithmic scales, let’s see if there’s any meaningful correlation between the two.

Artwork size vs price

Generally, yes, it appears there is some positive correlation between size and price realized.

6. Realized Price Varies by Artist Nationality

How does realized price vary with artist nationality?

Price vs nationality

There do seem to be some differences here, but because I intend to create models for each artist, this feature won’t really matter in the end. But still interesting to see!

7. Realized Price Doesn’t Vary Much by Gender

How does realized price vary by gender? First let’s check to see how many women artists this dataset contains:

# Count number of artists for each nationality
cols=['Gender', 'artist_name']
data.loc[~data.duplicated(subset=cols), cols]['Gender'].value_counts()
Male      129
Female     12
Name: Gender, dtype: int64

Not a huge sample, unfortunately, but let’s see.

Price vs gender

There is some difference here, but the median realized price is quite close for men and women. Prices for male artists, however, have much more variability as the lower chart shows.

Like Nationality, this feature won’t really come into play since I’ll be making artist-specific models.

8. Realized Price Varies by Artist’s Generation

How does realized price vary by artist generation? To do this, I’ll divide artists into decades by their birth year. For artists born prior to 1800, of which there are a couple in this dataset, I’ll lump them into a ‘pre-1800’ category.

Price vs artist generation

I’m not sure how useful this information is, since the differences we see can easily be attributed to the artists and the specifics of their markets. For instance, it turns out that, in this dataset, there’s only one artist who was born in the 1850s, and that’s Van Gogh, who evidently fetches consistently high prices. But as with Gender and Nationality, this feature won’t be a concern of mine when building artist-specific models.

9. Realized Price and Artwork Date are Negatively Correlated (Older Works Sell for More)

How does price correlate with an artwork’s completion date? Are certain periods of art production more valuable than others? There are a few works in this dataset from prior to 1800–I’ll do without those so we can focus on work made from ~1850 to present, which is where the bulk of our data is.

Price vs date

Here we see a slight negative correlation between artwork year and price, indicating a value premium put on older works vs. newer ones–makes sense.

10. Realized Price and Auction Year are Positively Correlated (Artist Markets Accrue in Value)

How does price correlate with auction year? Because we’re using constant 2022 dollars, any changes we see should be a function not of inflation but of value increasing over time.

Price vs Auction year

As expected, here we can see a slight positive correlation between auction year and realized price, suggesting, again, that artist values are increasing over time in aggregate.

11. Realized Price Varies by Auction House

Do different auction houses correlate with different price ranges? To look into this, I’ll reduce the cardinality of the auction_house_name feature so that we’re looking at the main players and a catch-all category for everyone else.

Price vs auction house

There are clear differences here, it seems, so the auction house seems like it will be a valuable predictor of price. But I’ll want to reduce the cardinality, as I have above, for each individual artist market, since not all artists will have this same proportion of auction house representation.

12. Realized Price Varies by Auction Location

We have some data for auction location in this dataset. Let’s see if that has any bearing on price.

Price vs Location

Here, too, we can see important trends, since certain locations correlate with higher or lower prices.

13. Dead Artists Fetch Higher Prices than Living Artists (but it’s complicated)

How does whether or not an artist is living at the time of auction affect its price? It feels rather obvious to me that prices will go up after an artist is no longer living–not only because there is no more work being created, but also because this implies that the artwork itself is older, which we’ve seen correlates positively with price.

Price vs Living/Deceased

No surprises here.

I am curious, though, if there are trends when we examine prices as a function of how many years before or after an artist’s death the auction took place:

Price vs Living/Deceased, Line chart

This is interesting, since it helps us see that median price does rise during an artist’s lifetime. For some unexpected reason, there is a precipitous drop in realized price immediately following an artist’s death–my suspicion is that collectors aren’t selling so much and, if they are, not major works. And within about 25 years, median prices have recovered and continue to rise.

Here’s another way of considering this:

Price vs Living/Deceased, Scatter

What’s interesting to note here is that prices generally seem to rise more quickly over the course of an artist’s lifetime than they do after his/her death.

14. Artist Age at Auction and Price Realized are Positively Correlated

Price vs artist age

No surprises here. As an artist ages, auction prices go up, which makes sense since the artist’s legacy is that much more secure in addition to the fact that his/her oeuvre is accruing value over time, independent of inflation, which we’ve already seen.

15. Realized Price and Artist Age at Artwork Completion are Mostly Uncorrelated

What about how an artist’s age at the time a given artwork was completed correlates with realized price?

Price vs Artist Age at Artwork Completion

I don’t see any meaningful correlation here really, but my intuition is that this may be correlated, negatively or positively, for different artists where the market favors, for instance, early career work or late career work, etc.

16. Realized Price and Artwork Age at Auction are Postively Correlated

Price vs artwork age

And this, too, looks like what we’d expect: Older artworks fetch higher prices.


Python Pandas Matplotlib Seaborn NumPy