data engineering (DB에 table 만들기)

2019-12-17

Spotify가 국내에 음원 진출을 확정지었다는 기사를 보면서 다시 한번 이 토이프로젝트에 대해 동기부여가 되었다. Spotify API를 통해 AWS에 만들어 놓은 DB에 입력해 볼 것이다.

Spotify

Spotify API를 이용해서 DB 구축하기

먼저, 혹시라도 필자의 토이 프로젝트의 목표가 무엇인지 모르실 분들을 위해 말하자면, Spotify data를 통해서 자신의 취향을 입력하면 그 취향과 비슷한 장르의 음악과 뮤지션을 추천해주는 chat bot을 Facebook API를 통해서 만들고자 한다.

toy project DB관련 도식

위와 같은 서비스를 만들기 위해서는 먼저 필요한 데이터를 얻기 위해 Spotify API의 response로 얻을 수 있는 값들이 무엇이 있고, 어떤 method를 통해 어떤 데이터를 얻을 수 있는지 정리를 해 보고 간단한 ERD를 통해 도식화 할 수 있을 것이다.

Spotify developer를 클릭해서 아래 그림과 같이 DOCS 탭에서 WEB API reference를 클릭하면 여러 항목들이 나올 것이다.

Spotify Web reference

여러 탭들 중 Artist와 관련된 사항들과 track에 따른 Audio feature들을 필요로 하기 때문에 Artists 탭과 Tracks 탭을 살펴보면 될 것이다. 확인해 본 결과 필요한 부분들을 정해 아래 그림과 같이 ERD를 도식화 할 수 있다.

Spotify ERD

Get an Artist 탭에서 얻을 수 있는 테이블은 다음과 같다. 특히 genre는 가져올때는 python의 list형식으로 되어 있기 때문에 우리가 사용할 RDB에 저장할 때는 list로 저장할 수 없으므로 또 다른 테이블을 만들어 주어 하나의 id에 여러가지 genre 값을 갖을 수 있게끔 사용할 것이다.
- artist라는 테이블은 ‘id’, ‘Name’, ‘Followers’, ‘Popularity’, ‘External_Urls’(해당 가수 페이지의 url)을 넣어 줄 것이다. 그 중 ‘id’는 Not NuLL과 Unique한 것을 동시에 만족시키는 Primary Key로 지정할 것이다. 왜냐하면 고유한 식별자이며, 이를 통해 다른 테이블들과의 join하는 것이나 검색에 있어서 훨씬 빠른 검색을 할 수 있게 되기 때문이다.
- artist genres 테이블은 ‘Artist_Id’, ‘Genre’ 값을 갖게 할 것이며, 여기서는 Artist_Id가 Foreign Key 역할을 할 것이다.
Get an Artist's Top Tracks에서는 top_tracks 테이블을 얻을 수 있다. ‘id’(track id), ‘Artist ID’, ‘Name’, ‘Popularity’(spotify에서 계산한 특정 알고리즘에 의해 산출된 0~100사이의 score), ‘URL’, ‘image_url’을 넣어 줄 것이다.
마지막으로 유일하게 Tracks탭에서 Get Audio Features for Several Tracks를 통해 audio feature인 ‘track_id’, ‘Key’, ‘Mode’, ‘Acousticness’, ‘Danceability’, ‘Energy’, ‘Instrumentalness’, ‘Liveness’, ‘Loudness’, ‘Speechiness’, ‘Valence’, ‘Tempo’를 넣어 줄 것이다.
- ‘Key’ : 음의 높낮이를 의미하며, 각각에 해당하는 음의 높낮이를 정수에 mapping했다. E.g. 0 = C, 1 = C♯/D♭, 2 = D등
- ‘Mode’ : 장조(도레미파솔라시)로 이루어져있다면 1, 단조(라시도레미파솔)로 이루어져 있다면 0을 갖는다.
- ‘Acousticness’ : Acoustic적인 트랙인지를 말해주는 지표로서, 0~1값을 갖고, 해당 트랙이 Acoustic적인 음악일수록 1값을 갖는다.
- ‘Danceability’ : 음악적 요소인 템포, 리듬의 안정성등의 조합을 기반으로하여 춤에 적합한지를 나타내는 지표이다. 0~1사이의 값을 갖으며, 춤에 적합한 트랙일수록 1의 값을 갖는다.
- ‘Energy’ : 활동적이고 긴장을 줄 수 있는 트랙인지에 대한 지표로서 0~1사이의 값을 갖는다.
- ‘Instrumentalness’ : 트랙이 목소리가 주인지를 파악하는 지표로서, 여기서 말하는 목소리란, 추임새적인 부분들을 악기적인 요소로 보고 이런 악기적인 요소가 많을수록 1에 가깝운 값을 갖는다.
- ‘Liveness’ : 해당 트랙이 live음원인지를 파악하는 요소로서, 0.8을 넘는다면 live 트랙일 확률이 높다.
- ‘Loudness’ : 트랙 전반적인 평균 데시벨(dB)로서, 상대적으로 다른 트랙들과 비교할 수 있다. 일반적인 값은 -60에서 0 db 사이이다.
- ‘Speechiness’ : 토크쇼, 오디오북, 시같이 구어체 단어들의 존재를 탐지하는 지표로서 0.66 값 이상은 아마도 완전히 구어체로 만들어진 트랙을 의미한다고 하며, 1에 가까울수록 구어체를 많이 가지고 있다.
- ‘Valence’ : 트랙의 감성을 나타내는 지표로서, 0~1사이의 값을 갖는다. 1에 가까울수록 밝고 긍정적인 느낌의 트랙을 의미한다.
- ‘Tempo’ : 전반적인 트랙의 BPM을 추정하는 지표이다.

Get Audio Features for Several Tracks

DB Table 생성

Python Script 파일을 작성하기에 앞서, 먼저 SQL에 접속한 후 데이터를 insert할 경우 어떤 구문을 사용하여야 적합할지에 대해 알아 볼 것이다.

## artists에 관한 table
# 이모지까지 커버 하고 싶은 경우
mysql> create table artists (id VARCHAR(255), name VARCHAR(255), followers INTEGER, popularity INTEGER, url VARCHAR(255), image_url VARCHAR(255), PRIMARY KEY(id)) ENGINE=InnoDB DEFAULT CHARSET='utf8mb4' COLLATE 'utfmb4_unicode_ci'

# 이모지는 제외하고 문자만 커버하는 경우
mysql> create table artists (id VARCHAR(255), name VARCHAR(255), followers INTEGER, popularity INTEGER, url VARCHAR(255), image_url VARCHAR(255), PRIMARY KEY(id)) ENGINE=InnoDB DEFAULT CHARSET='utf8'

## artist_genres에 관한 테이블
mysql> create table artist_genres (artist_id VARCHAR(255), genre VARCHAR(255)) ENGINE=InnoDB DEFAULT CHARSET='utf8';

# 위에서 만든 artists table에 대한 info를 보고 싶은 경우
mysql> show create table artists;

앞으로 artist_genres 테이블에 어떤 artist의 장르가 추가된다면, 데이터를 추가해주어야 하는데 지속적으로 추가해주어야하므로 자동화를 할 것이다. 추가해 주는 값이 테이블에 이미 있는 값을 갖는 데이터가 들어온다면 무의미할 것이다. 아무런 column에 제약을 주지 않았기에 insert를 통한 데이터 추가 방식은 무의미하다.

# 데이터 추가
mysql> insert into artist_genres (artist_id, genre) values ('1234', 'pop');

# 테이블 확인
mysql> select * from artist_genre;

# 데이터 추가
mysql> insert into artist_genres (artist_id, genre) values ('1234', 'pop');

# 테이블 확인
mysql> select * from artist_genre;

# 테이블 값만 삭제
mysql> delete from artist_genres;

# 테이블 자체를 삭제
# 실무에서는 drop은 잘 사용하지 않고, Alter를 사용한다.
mysql> drop table artist_genres;

앞에서 말한 추후에 데이터 입력시 동일한 데이터를 계속 추가하는 문제를 방지하기 위해 column에 unique key 속성을 추가해 주었다.

이전의 방법과 동일하게 추가했을 경우 insert into구문은 오류를 발생하여서 추가 데이터를 저장하진 않지만, Python script가 도중에 멈추게 되는 문제가 생긴다는 점을 확인 할 수 있다.

# Unique key 속성 부여
mysql> create table artist_genres (artist_id VARCHAR(255), genre VARCHAR(255), unique key(artist_id, genre)) ENGINE=InnoDB DEFAULT CHARSET='utf8';

# 속성 추가 확인
mysql> show create table artist_genres;

# 데이터 추가
mysql> insert into artist_genres (artist_id, genre) values ('1234', 'pop');

# 데이터 확인
mysql> select * from artist_genres;

# 데이터 추가
mysql> insert into artist_genres (artist_id, genre) values ('1234', 'pop');

ERROR 1062 (23000): Duplicate entry '1234-pop' for key 'artist_id'

insert into 구문 대신 update set 구문을 사용해보았다. 우선 키값이 있기 때문에 중복되는 값이 저장되지는 않는다. update set 구문은 NULL데이터를 가지고 있었다면 변경되지 않는다는 문제점이 있다.
replace into 구문을 통해 값이 변경되기 했지만 성능적인 측면에서, 데이터가 많을땐 먼저 키값이 있는지 찾고, 키 값이 존재한다면 지금처럼 그 행을 지우고 새로운 행으로 바꿔준다.
또한, primary key와 auto increment를 통해 설정되어있던 테이블이라면 원래 primary key로 인해 부여 받은 값이 아닌 새로운 값을 부여받게된다는 문제점이 있다. 예를 들면, auto increment로 DB에 입력되어 지는대로 번호를 부여했는데, DB에 입력된지 오래된 id에 genre를 추가하려고 한다면 기존의 행번호를 지우고 table의 맨뒤에 새로 입력된다는 것이다.
on duplicate key update를 통해 문제점을 해결 할 수 있다!

# 데이터 추가
mysql> update artist_genres set genre='pop' where artist_id='1234';

# columns 추가
mysql> alter table artist_genres add column country VARCHAR(255);

# update 되는 시점을 갖는 column 추가
# 자동적으로 데이터가 추가 될때마다 그 시점이 저장됨
mysql> alter table artist_genres add column updated_at timestamp default current_timestamp on update current_timestamp;

# 동일한 키값을 갖는 데이터가 이미 존재하므로 오류를 발생시킴.
mysql> INSERT INTO artist_genres (artist_id, genre, country) VALUES ('1234', 'pop', 'UK');

# replace into 구문을 통해 값이 변경되기 했지만 성능적인 측면에서, 데이터가 많을땐 먼저 키값이 있는지 찾고, 존재한다면 지금처럼 그 행을 지우고 새로운 행으로 바꿔준다.
mysql> REPLACE INTO artist_genres (artist_id, genre, country) VALUES ('1234', 'pop', 'UK');
# 그러므로 위의 조건에 부합하는 행위인 키값인 artist_id와 genre값이 있는 행을 찾기 때문에 2rows가된다.
Query OK, 2 rows affected (0.28 sec)

# 위에서 replace into 구문을 통해 값이 변경되었음을 확인 할 수 있다.
mysql> select * from artist_genres;

# artist_id, genre가 Unique key인데 두 컬럼을 동시에 동일한 값을 갖는 row가 없으므로 새로 추가 해준다.
mysql> REPLACE INTO artist_genres (artist_id, genre, country) VALUES ('1234', 'rock', 'UK');

mysql> select * from artist_genres;

# 말 그대로 동일한 키값이 있으면 그냥 insert into 구문은 오류를 발생시켰지만, insert ignore into 구문은 오류를 발생시키지 않고 무시한다.
# unique한 key값인 artist_id와 genre가 동일한 행이 이미 테이블에 존재하기 때문이다.
mysql> insert ignore into artist_genres (artist_id, genre, country) VALUES ('1234', 'rock', 'FR');

mysql> select * from artist_genres;

# on duplicate key update
mysql> insert into artist_genres (artist_id, genre, country) values ('1234', 'rock', 'FR') on duplicate key update artist_id='1234', genre='rock', country='FR'


# 우리가 입의로 넣어 주었던 country는 지워줄 것이다.
mysql> alter table artist_genres drop column country;

결론은 다음 구문을 query문으로 사용하겠다는 것이다.

1	mysql> insert into artist_genres (artist_id, genre, country) values ('1234', 'rock', 'FR') on duplicate key update artist_id='1234', genre='rock', country='FR'

create_artist_table.py

import sys
import requests
import base64
import json
import logging
import pymysql
import csv
import sys, os, argparse


def main(host, user, passwd, db, port, client_id, client_secret):

    try:
        conn = pymysql.connect(host=host, user=username, passwd=password, db=database, port=port, use_unicode=True, charset='utf8')
        cursor = conn.cursor()
    except:
        logging.error("could not connect to rds")
        sys.exit(1)

    headers = get_headers(client_id, client_secret)

    ## Spotify Search API

    artists = []
    with open('artist_list.csv') as f:
        raw = csv.reader(f)
        for row in raw:
            artists.append(row[0])

    for a in artists:
        params = {
            "q": a,
            "type": "artist",
            "limit": "1"
        }

        r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

        raw = json.loads(r.text)

        artist = {}
        try:
            artist_raw = raw['artists']['items'][0]
            if artist_raw['name'] == params['q']:

                artist.update(
                    {
                        'id': artist_raw['id'],
                        'name': artist_raw['name'],
                        'followers': artist_raw['followers']['total'],
                        'popularity': artist_raw['popularity'],
                        'url': artist_raw['external_urls']['spotify'],
                        'image_url': artist_raw['images'][0]['url']
                    }
                )
                insert_row(cursor, artist, 'artists')
        except:
            logging.error('something worng')
            continue

    conn.commit()
    sys.exit(0)


    try:
        r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

    except:
        logging.error(r.text)
        sys.exit(1)


    r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

    if r.status_code != 200:
        logging.error(r.text)

        if r.status_code == 429:

            retry_after = json.loads(r.headers)['Retry-After']
            time.sleep(int(retry_after))

            r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

        ## access_token expired
        elif r.status_code == 401:

            headers = get_headers(client_id, client_secret)
            r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

        else:
            sys.exit(1)

def get_headers(client_id, client_secret):

    endpoint = "https://accounts.spotify.com/api/token"
    # base64 encode 및 decode 함수는 모두 byte 형 객체를 필요로 한다.
    # 문자열을 byte로 가져 오려면, 파이썬의 내장 된 encode 함수를 사용하여 문자열을 encoding해주어야 한다.
    encoded = base64.b64encode("{}:{}".format(client_id, client_secret).encode('utf-8')).decode('ascii')

    headers = {
        "Authorization": "Basic {}".format(encoded)
    }

    payload = {
        "grant_type": "client_credentials"
    }

    r = requests.post(endpoint, data=payload, headers=headers)
    # r.text가 type이 string이므로 json형태로 만들어 주기 위해 json 패키지를 사용한다.
    access_token = json.loads(r.text)['access_token']

    # Spotif Web API를 접근 가능하게 하는 Access Token은 다음과 같은 형식으로 사용가능하고 Authrization 페이지에 나와있다.
    headers = {
        "Authorization": "Bearer {}".format(access_token)
    }

    return headers


def insert_row(cursor, data, table):

    placeholders = ', '.join(['%s'] * len(data)) # "%s, %s, %s, %s, %s, %s"
    columns = ', '.join(data.keys()) # "id, name, follwers, popularity, url, image_url"
    key_placeholders = ', '.join(['{0}=%s'.format(k) for k in data.keys()]) # "id=%s, name=%s, follwers=%s, popularity=%s, url=%s, image_url=%s"
    sql = "INSERT INTO %s ( %s ) VALUES ( %s ) ON DUPLICATE KEY UPDATE %s" % (table, columns, placeholders, key_placeholders)
    cursor.execute(sql, list(data.values())*2)



if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--client_id', type=str, help='Spotify app client id')
    parser.add_argument('--client_secret', type=str, help='Spotify client secret')
    parser.add_argument('--host', type=str, help='end point host')
    parser.add_argument('--username', type=str, help='AWS RDS id')
    parser.add_argument('--database', type=str, help='DB name')
    parser.add_argument('--password', type=str, help='AWS RDS password')
    args = parser.parse_args()
    port = 3306
    main(host=args.host, user=args.username, passwd=args.password, db=args.database, port=port, client_id=args.client_id, client_secret=args.client_secret)

결과 이미지

python script를 짜고 추후에 script를 실행하여 바로 RDS에 저장하고 table이 제대로 생성됬는지 확인하였다.

artist 테이블 생성

artist genre table

artist genre table은 최대로 50개를 검색할수 있는 Spotify search API 메서드를 통해 batch 단위로 검색할 수 있도록 Python script를 구성하였다. 먼저 artist table에 있는 id들을 모두 fetch_all한 후에 batch size인 50개씩 묶어서 list로 만들어 둔 후 join을 통해 한꺼번에 search 하는 방법을 사용하였다.
search API 사용법

import sys
import requests
import base64
import json
import logging
import pymysql
import csv
import sys, os, argparse

def main(host, user, passwd, db, port, client_id, client_secret):

    try:
        # use_unicode=True를 써야 한글같은 경우는 깨지지 않는다.
        conn = pymysql.connect(host=host, user=username, passwd=password, db=database, port=port, use_unicode=True, charset='utf8')
        cursor = conn.cursor()
    except:
        logging.error("could not connect to rds")
        # 보통 문제가 없으면 0
        # 문제가 있으면 1을 리턴하도록 안에 숫자를 넣어준다.
        sys.exit(1)

    headers = get_headers(client_id, client_secret)

    cursor.execute("SELECT id  FROM artists")
    artists = []
    # 커서의 fetchall() 메서드는 모든 데이타를 한꺼번에 클라이언트로 가져올 때 사용된다.
    # 또다른 fetch 메서드로서 fetchone()은 한번 호출에 하나의 Row 만을 가져올 때 사용된다.
    for (id, ) in cursor.fetchall():
        artists.append(id)

    artist_batch = [artists[i: i+50] for i in range(0, len(artists), 50)]

    artist_genres = []
    for i in artist_batch:

        ids = ','.join(i)
        URL = "https://api.spotify.com/v1/artists/?ids={}".format(ids)

        r = requests.get(URL, headers=headers)
        raw = json.loads(r.text)

        for artist in raw['artists']:
            for genre in artist['genres']:

                artist_genres.append(
                    {
                        'artist_id': artist['id'],
                        'genre': genre
                    }
                )

    for data in artist_genres:
        insert_row(cursor, data, 'artist_genres')

    # commit을 해줘야 records를 볼 수 있다.
    conn.commit()
    cursor.close()

    sys.exit(0)




    try:
        r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

    except:
        logging.error(r.text)
        sys.exit(1)


    r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

    if r.status_code != 200:
        logging.error(r.text)

        if r.status_code == 429:

            retry_after = json.loads(r.headers)['Retry-After']
            time.sleep(int(retry_after))

            r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

        ## access_token expired
        elif r.status_code == 401:

            headers = get_headers(client_id, client_secret)
            r = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)

        else:
            sys.exit(1)

def get_headers(client_id, client_secret):

    endpoint = "https://accounts.spotify.com/api/token"
    encoded = base64.b64encode("{}:{}".format(client_id, client_secret).encode('utf-8')).decode('ascii')

    headers = {
        "Authorization": "Basic {}".format(encoded)
    }

    payload = {
        "grant_type": "client_credentials"
    }

    r = requests.post(endpoint, data=payload, headers=headers)

    access_token = json.loads(r.text)['access_token']

    headers = {
        "Authorization": "Bearer {}".format(access_token)
    }

    return headers


def insert_row(cursor, data, table):
    """
    insert into query문 작성및 실행을 좀 더 간편하게 하기 위해 만든 함수
    """

    placeholders = ', '.join(['%s'] * len(data))
    columns = ', '.join(data.keys())
    key_placeholders = ', '.join(['{0}=%s'.format(k) for k in data.keys()])
    sql = "INSERT INTO %s ( %s ) VALUES ( %s ) ON DUPLICATE KEY UPDATE %s" % (table, columns, placeholders, key_placeholders)
    cursor.execute(sql, list(data.values())*2)




if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--client_id', type=str, help='Spotify app client id')
    parser.add_argument('--client_secret', type=str, help='Spotify client secret')
    parser.add_argument('--host', type=str, help='end point host')
    parser.add_argument('--username', type=str, help='AWS RDS id')
    parser.add_argument('--database', type=str, help='DB name')
    parser.add_argument('--password', type=str, help='AWS RDS password')
    args = parser.parse_args()
    port = 3306
    main(host=args.host, user=args.username, passwd=args.password, db=args.database, port=port, client_id=args.client_id, client_secret=args.client_secret)

결과 이미지

python script를 짜고 추후에 script를 실행하여 바로 RDS에 저장하고 table이 제대로 생성됬는지 확인하였다.

artist genres 테이블 생성

API를 통해 크롤링한 데이터 DB에서 분석하기

DB를 통해 간단한 분석을 실행해 볼 것이다.
먼저 artists table에서 상위 10개의 records를 보여주는 SQL구문을 통해 데이터 구성을 살펴볼 것이다. 아래 그림과 같이 출력되며, 전체 column이 id, name, followers, popularity, image_url들이 있다.

1	SELECT * FROM artists LIMIT 10;

table의 상위 10개 record

artist당 평균적으로 몇개의 장르를 갖고 있는지를 살펴보기 위해 아래와 같은 계산을 했다. 결과적으로 전체 artists는 488명을 search해서 가져왔고, artist_genres table에서는 전체 artists의 명수인 489로 나누어져 평균적인 artist 1명당 갖는 genre는 6개인 것을 확인할 수 있었다.

1
2
3

SELECT COUNT(*) FROM artists;

SELECT COUNT(*)/489 FROM artist_genres;

결과

+----------+
| COUNT(*) |
+----------+
|      488 |
+----------+
1 row in set (0.01 sec)

+--------------+
| COUNT(*)/489 |
+--------------+
|       6.4376 |
+--------------+
1 row in set (0.01 sec)

artist_genres table 에서 genre와 count를 보여주는데, genre로 GROUPING을 하고 그에 따른 갯수에 내림차순으로 보여달라고 Query문을 작성하여 살펴 보았더니, 가장 많은 genre는 rock이며, 그 다음은 classic rock이다.

SELECT genre, COUNT(*) AS count_num FROM artist_genres GROUP BY genre ORDER BY count_num DESC LIMIT 20;

# 동일한 결과를 출력
# SELECT genre, COUNT(*) FROM artist_genres GROUP BY 1 ORDER BY 2 DESC LIMIT 20;

결과

+-------------------+-----------+
| genre             | count_num |
+-------------------+-----------+
| rock              |       155 |
| classic rock      |        93 |
| album rock        |        90 |
| mellow gold       |        87 |
| folk rock         |        70 |
| soft rock         |        70 |
| art rock          |        66 |
| singer-songwriter |        61 |
| adult standards   |        51 |
| dance rock        |        50 |
| hard rock         |        50 |
| blues rock        |        49 |
| soul              |        45 |
| new wave pop      |        44 |
| roots rock        |        43 |
| permanent wave    |        40 |
| folk              |        37 |
| psychedelic rock  |        37 |
| pop rock          |        35 |
| pop               |        35 |
+-------------------+-----------+
20 rows in set (0.16 sec)

가장 popularity가 높은 가수는 아래와 같이 Drake이다. 그 다음은 Ariana Grande, Taylor Swift 순이다.

1	SELECT popularity, name FROM artists ORDER BY 1 DESC LIMIT 20;

결과

+------------+-----------------------+
| popularity | name                  |
+------------+-----------------------+
|         98 | Drake                 |
|         97 | Ariana Grande         |
|         95 | Taylor Swift          |
|         95 | Justin Bieber         |
|         95 | Kanye West            |
|         94 | Eminem                |
|         93 | Maroon 5              |
|         93 | Nicki Minaj           |
|         92 | Queen                 |
|         92 | Sam Smith             |
|         92 | Imagine Dragons       |
|         92 | Rihanna               |
|         91 | Kendrick Lamar        |
|         91 | The Beatles           |
|         91 | Lil Wayne             |
|         89 | Lana Del Rey          |
|         88 | Frank Sinatra         |
|         88 | Katy Perry            |
|         87 | Red Hot Chili Peppers |
|         87 | Snoop Dogg            |
+------------+-----------------------+
20 rows in set (0.01 sec)

이번에는 artist table에서 followers와 name 중 followers가 많은 순으로 내림차순하여 상위 20개의 records만 볼 것이다. 이 또한, Drake가 부동의 1위이다.

1	SELECT followers, name FROM artists ORDER BY 1 DESC LIMIT 20;

결과

+-----------+-----------------------+
| followers | name                  |
+-----------+-----------------------+
|  42943467 | Drake                 |
|  39573173 | Ariana Grande         |
|  35366176 | Rihanna               |
|  31737977 | Justin Bieber         |
|  31374657 | Eminem                |
|  25029508 | Taylor Swift          |
|  23906032 | Imagine Dragons       |
|  22042155 | Queen                 |
|  21244706 | Maroon 5              |
|  16648595 | Nicki Minaj           |
|  15813453 | Demi Lovato           |
|  14486822 | The Beatles           |
|  14319668 | Katy Perry            |
|  13656683 | Kendrick Lamar        |
|  12980433 | Michael Jackson       |
|  12587845 | Metallica             |
|  12205939 | Red Hot Chili Peppers |
|  11002859 | Pink Floyd            |
|  10840912 | Kanye West            |
|  10800211 | Sam Smith             |
+-----------+-----------------------+
20 rows in set (0.02 sec)

이번에는 artists table에서 popularity가 80 초과인 id를 갖는 artist들의 장르들이 무엇인지 살펴 보았다. 그 중 pop이 17건으로 제일 많았고, 그 다음은 rock이었다. 이 부분은 필자에게는 약간의 의외였다. 물론 약 500명의 artist만을 가져왔기에 객관적이지는 못하겠지만, 필자가 개인적으로 rock음악을 안들어서 인지 아직까지 rock도 인기가 있는 것으로 미루어보아 듣는 분들이 많다는 사실을 알게 되었다.

1	SELECT genre, COUNT(*) FROM artist_genres tb1 JOIN artists tb2 ON tb1.artist_id = tb2.id WHERE tb2.popularity > 80 GROUP BY 1 ORDER BY 2 DESC LIMIT 20

결과

+------------------+----------+
| genre            | COUNT(*) |
+------------------+----------+
| pop              |       17 |
| rock             |       16 |
| rap              |        9 |
| dance pop        |        9 |
| post-teen pop    |        7 |
| hip hop          |        7 |
| permanent wave   |        7 |
| album rock       |        6 |
| classic rock     |        6 |
| adult standards  |        5 |
| pop rap          |        5 |
| modern rock      |        4 |
| neo mellow       |        4 |
| hard rock        |        3 |
| alternative rock |        3 |
| g funk           |        3 |
| pop rock         |        3 |
| gangster rap     |        3 |
| post-grunge      |        3 |
| west coast rap   |        3 |
+------------------+----------+
20 rows in set (0.01 sec)