SPARQL queries of the Billboard Hot 100

Current and historical data!

[Hot 100 and SPARQL logos]

Wikipedia describes the Billboard Hot 100 as “the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), online streaming, and radio airplay in the U.S.” A song that ranks highly there is a hit song (in the U.S.) by definition. The data goes back to the beginning of the chart’s history in 1958, when Rick Nelson’s Poor Little Fool was the number one song.

I recently learned about billboard-hot-100, which is “a git repository of JSON files for every Billboard Hot 100 chart in history, updated daily”. Of course I thought it would be fun to query it with SPARQL, so I wrote something to convert the data to RDF. I did it as a github fork of the project that I called billboard-hot-100-rdf in case anyone else wants to play with it. One nice advantage of doing it this way is that, because Billboard updates their chart on Tuesday mornings and the billboard-hot-100 repository scrapes that data every Wednesday, you can do SPARQL queries against the latest data on Thursday through Monday.

Here is a sample of the JSON data from that billboard-hot-100 project. Instead of listing all the hits from that week, my excerpt only lists two:

[{
  "date": "2024-04-06",
  "data": [
    {
      "song": "Cruel Summer",
      "artist": "Taylor Swift",
      "this_week": 16,
      "last_week": 10,
      "peak_position": 1,
      "weeks_on_chart": 47
    },
    {
      "song": "Redrum",
      "artist": "21 Savage",
      "this_week": 30,
      "last_week": 22,
      "peak_position": 5,
      "weeks_on_chart": 11
    }  ]
}]

The h100json2rdf.py Python script in my fork of the project (which is only about 60 lines including white space and comments) converts the data above to this:

@prefix h1: <http://rdfdata.org/hot100#> .
@prefix schema: <http://schema.org/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

h1:TaylorSwift a h1:MusicalArtist ; 
   rdfs:label "Taylor Swift"@en .

h1:TaylorSwiftCruelSummer a schema:MusicRecording;
     schema:byArtist h1:TaylorSwift;
     dc:title "Cruel Summer";
     h1:charted "2024-04-06"^^xsd:date {| 
        h1:position 16
|}.

h1:21Savage a h1:MusicalArtist ; 
   rdfs:label "21 Savage"@en .

h1:21SavageRedrum a schema:MusicRecording;
     schema:byArtist h1:21Savage;
     dc:title "Redrum";
     h1:charted "2024-04-06"^^xsd:date {| 
        h1:position 30
|}.

A few things to note about the conversion:

  • As always, I used classes and prefixes from existing schemas when I could and made up new ones where necessary.

  • When the input data said that a given artist had a given hit, I created instances for both the artist and the song. As we’ll see below, this makes queries about artists like “who had hits in the most decades” easier.

  • URIs for the artists are made from their names with a little cleanup to make them proper URIs. URIs for the songs are similar but with the artist name and song title combined because sometimes completely different songs happen to have the same title and I wanted to distinguish them from each other. For example, as the charts tell us, Banarama had a hit with a song called “Cruel Summer” 30 years ago that is completely different from the Taylor Swift song of the same name.

  • To record that a given song charted on a given date at a particular position, I made the position value (h1:position above) a property of the triple about the song charting at that position. To put it another way, the position value is a property of the graph edge about the date.

  • I didn’t convert all the JSON properties from the original because, as we’ll see, values like weeks_on_chart are easy enough to query for with the data that I did convert.

I used my Python script to convert the project’s all.json data file into Turtle RDF, loaded the Turtle file into the free version of Ontotext GraphDB, and was ready to start querying. (Other files in that repository hold data for individual weeks, which can make some queries go much faster.)

Querying for the data that we didn’t convert

Our first few queries will show why the Python script didn’t bring all the numbers into the Turtle data. Let’s query for the number of weeks that a recording had been on the chart:

PREFIX h1: <http://rdfdata.org/hot100#>
PREFIX schema: <http://schema.org/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT (COUNT(?chartPosition) AS ?weeksOnChart) WHERE {
?recording a schema:MusicRecording ; 
               dc:title "Cruel Summer" ; 
               schema:byArtist/rdfs:label "Bananarama"@en .
    
   << ?recording h1:charted ?chartDate >> h1:position ?chartPosition .
}

The result of this query is that Bananarama’s “Cruel Summer” was on the charts for 18 weeks. Comparing the November 17th and November 24th charts from 1984 confirms that November 17th was the 18th and last chart appearance of Bananarama’s “Cruel Summer”.

Change the SELECT line in that query to the following and you’ll learn that recording’s highest U.S. chart position:

SELECT (MIN(?chartPosition) AS ?highestPosition) WHERE {

The query result is 9, which is confirmed as the recording’s highest chart position on its Wikipedia page.

How do we query for a recording’s position on the chart the week before a given appearance? We’ll ask about Dua Lipa’s “Houdini”. First the following query has a nested query that finds the most recent date that the recording was on the chart, and then the outer query asks about its position the week before that:

PREFIX h1: <http://rdfdata.org/hot100#>
PREFIX schema: <http://schema.org/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?dateLastWeek ?positionLastWeek WHERE {
   # 2. Find out the week before the latest chart 
   # appearance and the position from that week. 
   BIND (?latestPosition - "P7D"^^xsd:duration AS ?dateLastWeek)
   << ?recording h1:charted ?dateLastWeek >> h1:position ?positionLastWeek . 
   {
     # 1. Find the date of the latest chart appearance. 
     SELECT ?recording (MAX(?chartDate) AS ?latestPosition) WHERE {
       ?recording a schema:MusicRecording ; 
                  dc:title "Houdini" ; 
                  schema:byArtist/rdfs:label "Dua Lipa"@en .
       << ?recording h1:charted ?chartDate >> h1:position ?chartPosition .
      }
   GROUP BY ?recording
   } 
} 

The result tells me that last week (as I write this), on the March 30th chart, this song was at position 29, which is confirmed on the Billboard web site.

Querying for new things

Besides querying for data that the Python script didn’t bring over to the Turtle data, what else can we query for? For example, which artist had hits in the most decades? For the decade, this next query takes the first three digits of the chart date and then groups and sorts by that. Then, it lists everyone who has had hits in at least five decades:

PREFIX schema: <http://schema.org/>
PREFIX h1: <http://rdfdata.org/hot100#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

SELECT (COUNT(DISTINCT ?decade) AS ?decades) ?artistName WHERE { 
?recording a schema:MusicRecording ; 
             schema:byArtist/rdfs:label ?artistName ;
             h1:charted ?chartDate . 
   BIND (SUBSTR(str(?chartDate),1,3) AS ?decade)
}
GROUP BY ?artistName
HAVING (?decades > 4)
ORDER BY DESC(?decades)

The results:

?decades,?artistName
7,"Elvis Presley"@en
6,"Frank Sinatra"@en
6,"Cher"@en
6,"Paul McCartney"@en
6,"Michael Jackson"@en
5,"Chuck Berry"@en
5,"Andy Williams"@en
5,"The Isley Brothers"@en
5,"Brenda Lee"@en
5,"The Beatles"@en
5,"The Rolling Stones"@en
5,"Stevie Wonder"@en
5,"Fleetwood Mac"@en
5,"Eagles"@en
5,"Prince"@en

In what decades did Little Richard have hits?

PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX h1:     <http://rdfdata.org/hot100#>
PREFIX schema: <http://schema.org/>

SELECT DISTINCT ?decade WHERE {
  ?artist rdfs:label "Little Richard"@en . 
  ?s schema:byArtist ?artist ;
     h1:charted ?chartDate .
  BIND (CONCAT(SUBSTR(str(?chartDate),1,3),'*') AS ?decade)
}

(The three-digit year values looked odd so I added asterisks to make them look more like years with wildcards.)

?decade
"195*"
"196*"
"197*"
"198*"

To be honest, everything I’ve done so far could be done with a relational database. I’ve been experimenting with ways to make this a real knowledge graph by adding additional data from Wikidata. I have some ideas for more interesting queries to make about the artists and their relationships to their hits.


Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.