4 minute read

A series of posts related to an art auction price model project.

Art Auction Data: Exploratory Data Analysis, Andy Warhol Case Study

Now that I have a better sense of how the various features in this dataset relate to realized price, I want to look at a specific artist, since my sense is that individual artist markets obey their own unique logics. (In fact, I’d wager that this is true at an even more granular level of specific bodies of work, but I digress.)

Warhol presenting charts and graphy
Warhol (okay, not quite, but good enough) presenting charts and graphs, created by Midjourney

Introductory Notes

Much of this analysis follows a similar trajectory as the more general EDA I’ve done for this dataset, except this time with an eye towards the specific logic of Andy Warhol’s painting market for the past 30-40 years. As with the general EDA for this project, I have opted to use constant 2022 dollars rather than nominal dollars from each observation’s given auction year in order to eliminate the inflation variable from analysis to the extent that this is possible. This means that price differences we see in similar artworks should be a function of Warhol’s painting market accruing value.

1. Warhol’s Auction Price Distribution is Positively Skewed but Log-Normal

As with the more general dataset, Warhol’s auction prices show an aggressively positive skew but are log normal.

Warhol's auction price dist is positively skews

As we can see, the distribution has an aggressive positive skew, with one sale approaching \$200M–much higher than the median sale price, which looks closer to \$300k or so.

Let’s view the distribution as a histogram.

Hist of Warhol's pos skewed dist

Log-normalized prices for warhol

Seeing that Warhol’s auction prices are log normal suggests that our model will more effectively handle outliers if it predicts the log of auction prices.

2. Warhol’s Painting Dimensions are Also Log-Normal

Warhol's painting dimensions dist.

Now let’s see if taking the log of these dimensions helps us see the distribution better.

Log-normalized painting dims

Looking at this, my instinct is that training the model not on the raw dimensions but on their log will help generate more accurate predictions.

3. Auction Price and Painting Size are Positively Correlated for Warhol

What was true of the overall dataset is true in Warhol’s case, too.

Price vs dims

4. ICYMI: Warhol’s Value is Increasing Over Time

Since we’ve already adjusted for inflation, any change we see should be due to Warhol’s value increasing over time, not the incidental effects of inflation.

Price over time

5. Prices Vary by Auction House

As with the larger dataset, prices for Warhol’s paintings vary by auction house.

Price vs auction house

6. Prices Vary by Auction Location

Same is true of auction location.

Price vs auction location

7. Prices Vary by Auction Decade

This makes sense: Artist markets (like anything, really) go through cycles. In this case, it looks like the 2010s saw especially high sales, which could be a sign of more demand and/or less supply, or just a function of what works came up for auction at a given time

Price vs auction decade

What if we look at a more granular level, say ever 5-year window?

Price vs auction year

8. Prices Vary by Artwork Date

This makes perfect sense–i.e., that certain periods of Warhol’s career will be more or less valuable than others. I expect this is a function of supply (works from certain periods may be more or less scarce) and historical significance (works from certain periods may be deemed more or less significant), among other things.

Price vs. Artwork date

The artwork’s date poses some interesting problems when conceiving it as a feature for the entire dataset. For an artist like Warhol, the 1960s may be a hugely important time. Whereas for an artist who passed away in the 1950s, this same period would be meaningless.

One solution is to date artworks not by year but relative to an artist’s age. This would allow us to compare two works made at comparable periods in their artist’s life even if one was completed in 1920 and the other in 1965.

This solution would help normalize artwork dates, but it is still limited, since it does not account for the fact that some artists created their more important work early in their career and others later in their career.

Warhol is actually an interesting example here, since he broke into the artwork relatively late in life, which many may not realize: Through his 20s he worked professionally as a commercial artist; it was not until he was 34 years old when he debuted his 32 Campbell’s Soup Can screenprint works at Ferus Gallery in Los Angeles in 1962! So it won’t come as any surprise, then, to see that Warhol’s most highly-valued works were made when he was 35-years old!

Price vs artist age

To make this point more concretely, look at how work made at various points in Warhol’s life compares to work made at the same periods in other artists’ lives.

For Warhol, the work made when he was 30–35 years old fetches the highest prices. For Mondrian, it’s his late work–stuff he made after 45. And for Braque, his early work is most highly valued.

Price vs artist age, comparison

9. Warhol’s Work Accrues Value with Age

Here’s another intuitive one: As Warhol’s work gets older, it accrues value.

Price vs artwork age


  • Target variable should be not price_realized_USD but its log.
  • For dimensional variables (width_cm, height_cm, and area_cm_sq), the model should be trained on their log (although there doesn’t appear to be much correlation here).
  • With all the other features we examined, there do seem to be very real and noticible correlations with price, which makes me think we’re on the right track for a functional model specific to Warhol.


Python Pandas Matplotlib Seaborn NumPy