2 minute read

I’m working on a machine learning project that will model auction prices for paintings. Based on some domain knowledge, I have a solid intuition that certain keywords included in the artwork title and medium features will be useful in predicting price outcomes. My eventual plan is to use sklearn.feature_extraction.text.TfidfVectorizer to create features from the words in each of these fields, but before doing that I thought it would be helpful to get a better sense of what the most frequently-occurring words even were.

Hence, a wordcloud. Turns out it’s super easy to generate these using the wordcloud module.

Andy Warhol

Let’s take the titles of Andy Warhol works as an example. As a reminder, we are not considering all Warhol works here, but only those works that appear in the auction data I collected, so any trends we notice should not be imputed to Warhol the artist.

Here’s a sample of the data:

artist_name title date medium dims auction_date auction_house auction_sale auction_lot price_realized ... auction_year price_realized_USD_constant_2022 area_cm_sq volume_cm_cu living years_after_death_of_auction artist_age_at_auction artist_age_at_artwork_completion artwork_age_at_auction years_ago_of_auction
2013 Andy Warhol The Shadow 1981 unique screenprint on Lenox Museum Board 96.5 x 96.5 cm Dec 14, 2022 Christie's First Open | Post-War & Contemporary Art Lot19 US\$52,920 ... 2022 52920.0 9312.25 NaN 0 35.0 NaN 53.0 41.0 1
2014 Andy Warhol Nervous System 1985 synthetic polymer on canvas 50.8 x 58.4 cm Dec 14, 2022 Christie's First Open | Post-War & Contemporary Art Lot39 US\$40,320 ... 2022 40320.0 2966.72 NaN 0 35.0 NaN 57.0 37.0 1
2015 Andy Warhol Portrait of Anselmino 1974 Acrylic and silkscreen on canvas 101.5 x 101.5 cm Dec 9, 2022 Ketterer Kunst Evening Sale with Collection Hermann Gerlinger Lot60 €375,000• US\$395,839 ... 2022 395839.0 10302.25 NaN 0 35.0 NaN 46.0 48.0 1
2016 Andy Warhol Vanishing Animals: Okapi 1986 synthetic polymer paint on paper 59.2 x 80 cm (23 1/4 x 31 1/2 in.) Dec 8, 2022 Phillips• London New Now Lot100 £15,120• US\$18,498 ... 2022 18498.0 4736.00 NaN 0 35.0 NaN 58.0 36.0 1
2017 Andy Warhol Tie 1979 acrylic on cut canvas 5.0 x 137.2 cm Dec 7, 2022 Sotheby's Contemporary Discoveries Lot159 NaN ... 2022 NaN 686.00 NaN 0 35.0 NaN 51.0 43.0 1

5 rows × 45 columns

At this point, we’re ready to create a wordcloud. The only slightly tricky bit is that the WordCloud.generate method takes a string, but the title feature of this DataFrame is a series, so i’ve used the str.join method to join the list items into one very long string.

# Import wordcloud module
from wordcloud import WordCloud

plt.subplots(figsize=(10, 8))

# Instantiate WordCloud object
wordcloud = (
    WordCloud(max_words=100, background_color='white', colormap='magma', width=800, height=800)
    .generate(' '.join(warhol['title'].dropna()))

# Show image
plt.imshow(wordcloud, interpolation='bilinear', )

Warhol Titles Wordcloud

This is helpful to get a sense of how frequently various words (and pairs of words, too) appear in the dataset.

Out of curiosity, let’s see if there are noticeable auction price differences (in constant 2022 dollars) for some of these key words.

Warhol Title Keywods Price Correlation

Looks like Jackie O works fetch a higher median price than, for instance, Campbell’s Soup works.

I’m also curious to generate a wordcloud for the medium feature.

plt.subplots(figsize=(10, 8))

wordcloud = (
    WordCloud(max_words=100, background_color='white', colormap='magma', width=800, height=800)
    .generate(' '.join(warhol['medium'].dropna()))

plt.imshow(wordcloud, interpolation='bilinear', )

Warhol Medium Wordcloud

For fun, let’s look at how realized price varies based on media.

Warhol Medium Keyword Price Correlation

That’s a striking difference and suggests that doing some Natural Language Processing on the medium feature could really be beneficial.