How to Quickly Generate Wordclouds in Python
I’m working on a machine learning project that will model auction prices for paintings. Based on some domain knowledge, I have a solid intuition that certain keywords included in the artwork title
and medium
features will be useful in predicting price outcomes. My eventual plan is to use sklearn.feature_extraction.text.TfidfVectorizer
to create features from the words in each of these fields, but before doing that I thought it would be helpful to get a better sense of what the most frequently-occurring words even were.
Hence, a wordcloud. Turns out it’s super easy to generate these using the wordcloud module.
Andy Warhol
Let’s take the titles of Andy Warhol works as an example. As a reminder, we are not considering all Warhol works here, but only those works that appear in the auction data I collected, so any trends we notice should not be imputed to Warhol the artist.
Here’s a sample of the data:
warhol.head()
artist_name | title | date | medium | dims | auction_date | auction_house | auction_sale | auction_lot | price_realized | ... | auction_year | price_realized_USD_constant_2022 | area_cm_sq | volume_cm_cu | living | years_after_death_of_auction | artist_age_at_auction | artist_age_at_artwork_completion | artwork_age_at_auction | years_ago_of_auction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 | Andy Warhol | The Shadow | 1981 | unique screenprint on Lenox Museum Board | 96.5 x 96.5 cm | Dec 14, 2022 | Christie's | First Open | Post-War & Contemporary Art | Lot19 | US\$52,920 | ... | 2022 | 52920.0 | 9312.25 | NaN | 0 | 35.0 | NaN | 53.0 | 41.0 | 1 |
2014 | Andy Warhol | Nervous System | 1985 | synthetic polymer on canvas | 50.8 x 58.4 cm | Dec 14, 2022 | Christie's | First Open | Post-War & Contemporary Art | Lot39 | US\$40,320 | ... | 2022 | 40320.0 | 2966.72 | NaN | 0 | 35.0 | NaN | 57.0 | 37.0 | 1 |
2015 | Andy Warhol | Portrait of Anselmino | 1974 | Acrylic and silkscreen on canvas | 101.5 x 101.5 cm | Dec 9, 2022 | Ketterer Kunst | Evening Sale with Collection Hermann Gerlinger | Lot60 | €375,000• US\$395,839 | ... | 2022 | 395839.0 | 10302.25 | NaN | 0 | 35.0 | NaN | 46.0 | 48.0 | 1 |
2016 | Andy Warhol | Vanishing Animals: Okapi | 1986 | synthetic polymer paint on paper | 59.2 x 80 cm (23 1/4 x 31 1/2 in.) | Dec 8, 2022 | Phillips• London | New Now | Lot100 | £15,120• US\$18,498 | ... | 2022 | 18498.0 | 4736.00 | NaN | 0 | 35.0 | NaN | 58.0 | 36.0 | 1 |
2017 | Andy Warhol | Tie | 1979 | acrylic on cut canvas | 5.0 x 137.2 cm | Dec 7, 2022 | Sotheby's | Contemporary Discoveries | Lot159 | NaN | ... | 2022 | NaN | 686.00 | NaN | 0 | 35.0 | NaN | 51.0 | 43.0 | 1 |
5 rows × 45 columns
At this point, we’re ready to create a wordcloud. The only slightly tricky bit is that the WordCloud.generate
method takes a string, but the title
feature of this DataFrame is a series, so i’ve used the str.join
method to join the list items into one very long string.
# Import wordcloud module
from wordcloud import WordCloud
plt.subplots(figsize=(10, 8))
# Instantiate WordCloud object
wordcloud = (
WordCloud(max_words=100, background_color='white', colormap='magma', width=800, height=800)
.generate(' '.join(warhol['title'].dropna()))
)
# Show image
plt.imshow(wordcloud, interpolation='bilinear', )
plt.axis('off');
This is helpful to get a sense of how frequently various words (and pairs of words, too) appear in the dataset.
Out of curiosity, let’s see if there are noticeable auction price differences (in constant 2022 dollars) for some of these key words.
Looks like Jackie O works fetch a higher median price than, for instance, Campbell’s Soup works.
I’m also curious to generate a wordcloud for the medium
feature.
plt.subplots(figsize=(10, 8))
wordcloud = (
WordCloud(max_words=100, background_color='white', colormap='magma', width=800, height=800)
.generate(' '.join(warhol['medium'].dropna()))
)
plt.imshow(wordcloud, interpolation='bilinear', )
plt.axis('off');
For fun, let’s look at how realized price varies based on media.
That’s a striking difference and suggests that doing some Natural Language Processing on the medium
feature could really be beneficial.