So much of what determines an artwork’s value (and it’s monetary value, too) has to do with features that are not readily quantifiable. Things like an artwork’s historical significance or significance within an artist’s oeuvre. The prestige of its provenance or exhibition history. Its renown. The extent to which it may be typical of, or else deviate from, an artist’s dominant tendencies–in general or at a during a specific period–either of which may make it more or less desireable. Imperfections which may likewise either augment its scarcity or else invalidate it. Its beauty, if that’s your thing. Things, in short, that belong to the domain of connoiseurship and result from deep intuitition about art history and the small and exclusive domain of art collectors.

Robo auctioneer, generated by DALL-E 2
Robo-auctioneer, generated by DALL-E 2

And then even if you do have the data, artists’ oeuvres and individual artworks are so specific that it’s hard to imagine there’s a huge sample to draw from. For instance, if a given artwork is relatively unique within an artist’s oeuvre, and if similar works haven’t come up for auction much, my sense is that it will be difficult to arrive at an expected value, even if the artist’s auction history is otherwise quite robust.

Because of this, I’ve often wondered whether auction prices can be effectively modeled with information that can be quantified. Things like size, date of production, the artist’s age when she completed the work, its medium, keywords in its title. So I set out to discover if this is the case.

Scraping the Data

Auction data is public, but still tricky to get en masse, which is why I started by building a a Python script to scrape the data (in full accordance with the source site’s robots.txt file, naturally). My automated spelunking expedition yieled some 53,034 auction records from 1986 to 2023, involving the work of 141 artists. Of these, 40,798 auction records resulted in valid sales.

I wasn’t scraping just any records, of course. To keep this project at least somewhat manageable for starters, I limited my scope to just works listed as paintings and artists who are considered blue chip, which I thought might mean less variability.

Processing, Parsing, and Wrangling the Data

The auction records I scraped came back with the following features:

  • artist name
  • artwork title
  • artwork date
  • artwork medium
  • artwork dimensions
  • auction date
  • auction house
  • auction sale name
  • auction lot number
  • price realized
  • price estimate
  • whether or not the work was bought in (i.e., did not sell)

This sounds well and good, but pre-processing and especially cleaning the data was a substantial undertaking that involved the following steps:

  • standardizing artist name formatting
  • merging with the Museum of Modern Art’s collection data for additional features: artist gender, nationality, and years of birth/death
  • manually adding gender, nationality, and birth/death data for scraped artists not in MoMA’s collection
  • parsing auction date as DateTime object
  • cleaning and parsing artwork dates, which came in a disturbing variety of formats, and extracting start and end dates, since sometimes artworks are dated to a range (e.g., 1956-57) or an approximate year (e.g., circa 1923, circa 1960s, 19th century).
  • resolving any weirdness with dates (e.g., an artwork’s completion date being after the auction date)
  • parsing artwork dimensions. This was by far the most varied feature in terms of format and units of measurment. My strategy was to use the unit of measurement (in this case, cm, mm, or in) as a regex token and I looked for the groups of numbers preceding it. Usually we’d expect two such groups–width and height–but often there was just one (in which case this field would usually specify that we were dealing with a diamter measurement) or three (in which case depth was, for whatever reason, included as a significant measurement), so my parsing of this feature had to account for these cases as well. I decided to adopt cm as my standard unit moving forward, which meant that I had to compute this in cases where I only had inches or milimeters.
  • parsing, extracting, and standardizing auction house and auction location information

Engineering New Features

In the process of parsing and wrangling the scraped data, I engineered some features that seemed obvious or potentially useful. Many of these have to do with normalizing the auction event or artwork date with respect to the artist’s life so I don’t have to rely on year-like features.

  • a boolean for whether or not the artist was living at the time of auction. I’m not sure about the degree to which it’s true, but I’ve heard that artists who are no longer living fetch higher prices, which makes sense intuitively for a few reasons: First, the artist’s legacy is that much more established or settled, and second the artist’s oeuvre is now finite in the absence of new work being made.
  • years after death of auction: For artists who were no longer living at the time of auction, I imagine it may be useful to know how long after that artist’s death the auction took place.
  • artist age at auction: For artists who were living at the time of auction, how old were they?
  • artist age at artwork completion: This is an interesting one for me. In the same way that certain periods of an artist’s work are more or less historically significant, my instinct is that certain periods are more or less desireable, as well, from a collector’s standpoint. My sense is that this is a value that could usefully be binned depending on the artist, so we end up with something like a proxy for early-, mid-, and late-career work.
  • artwork age at auction: How many years old was the artwork at the time of auction?
  • adjusting currency to constant 2022 dollars: Ultimately I want the model to predict realized price in nominal terms, but for the purposes of EDA I wanted to remove the inflation variable as much as possibl.

Natural Language Processing

The artwork title and medium features are a mess, so some rudimentary Natural Language Processing will help turn that messiness into features. With the medium field, it will offer a technique to help categorize the works by frequently occurring keyworks. Same goes for the title field: perhaps collectors prefer paintings with “portrait” in the title, or perhaps “portrait”-titled works are associated with a particularly significance part of the artist’s career.

EDA: General

The first step was some good, old-fashioned EDA. Even though I anticipate making artist-specific models, looking through the entire dataset might help justify that intuition as well as point the way to some overarching trends.

EDA: Andy Warhol Case Study

Next I wanted to look at a specific artist. I chose Warhol given his in spite of some of the major outliers in his auction record (including a 1964 Marilyn Monroe screenprint that sold for *gulp* $195 million in 2022, setting a record for the most expensive American artwork ever sold). He is very well-represented in this dataset, his titles are descriptive, and subject matter tends to be pretty limited to certain moments in time. In Part I of this case study, I look at all the features in the datset except for title and medium. In Part II (forthcoming) I look at the NLP features I’ve generated.


In progess.

Yet More Feature Engineering: Natural Language Processing

In progress.

For more on the state of this project, see my blog entries devoted to the topic.


Python Pandas Seaborn Matplotlib