API Reference¶
lyrics_scraping.scrapers
¶
lyrics_scraping.scrapers
is a package that contains modules that define
scrapers for specific lyrics websites, e.g. azlyrics_scraper
.
scrapers.lyrics_scraper
¶
Description¶
Module that defines the base class for scraping lyrics websites and saving the scraped content.
More specifically, the derived classes (e.g.
AZLyricsScraper
) are the ones that do the
actual scraping of the lyrics webpages.
By default, the scraped data is saved in a dictionary (see the variable
scraped_data
).
The scraped data can also be saved in a database if a path to the SQLite database is given via the argument db_filepath.
See the structure of the music database as defined in the music.sql schema.
Class and methods¶
-
class
scrapers.lyrics_scraper.
Album
(album_title, artist_name, album_url, year)[source]¶ Bases:
object
TODO
-
class
scrapers.lyrics_scraper.
Artist
(song_title, artist_name, artist_url)[source]¶ Bases:
object
TODO
-
class
scrapers.lyrics_scraper.
Lyrics
(song_title, artist_name, album_title, lyrics_url, lyrics_text, year)[source]¶ Bases:
object
TODO: remove, to be replaced by Song
-
class
scrapers.lyrics_scraper.
LyricsScraper
(db_filepath='', overwrite_db=False, use_webcache=True, webcache_dirpath='~/.cache/lyric_scraping/', expire_after=25920000, use_compute_cache=True, ram_size=100, http_get_timeout=5, delay_between_requests=8, headers=pyutils.webcache.WebCache.HEADERS, seed=123456, interactive=False, delay_interactive=30, best_match=False, simulate=False, ignore_errors=False)[source]¶ Bases:
object
Base class for scraping and saving webpages locally.
This class is responsible for doing lots of configuration before the web scraping starts, such as setting up logging and the database.
The actual scraping of the lyrics websites is done by the derived classes (e.g.
AZLyricsScraper
) since each lyrics websites have their own way of being crawled (they are all designed differently). However, the base class is responsible for saving the scraped data in a dictionary (scraped_data
) and in a database (if it was initially make configured).- Parameters
lyrics_urls (list [str]) – List of URLs to lyrics webpages which will be scraped.
db_filepath (str, optional) – File path to the SQLite music database (the default value is
None
which implies that no database will be used. The scraped data will be saved only in thescraped_data
dictionary).autocommit (bool, optional) – Whether the changes to the database are committed right away (the default is False which implies that the changes won’t take effect immediately).
overwrite_db (bool, optional) – Whether the database will be overwritten. The user is given some time to stop the script before the database is overwritten (the default value is False).
update_tables (bool, optional) – Whether the tables in the database can be updated (the default value is False).
cache_dirpath (str, optional) – Path to the cache directory where webpages are saved (the default value is
None
which implies that the cache will not be used).overwrite_webpages (bool, optional) – Whether the webpages saved in cache can be overwritten (the default value is False).
http_get_timeout (int, optional) – Timeout when a GET request doesn’t receive any response from the server. After the timeout expires, the GET request is dropped (the default value is 5 seconds).
delay_between_requests (int, optional) – A delay will be added between HTTP requests in order to reduce the workload on the server (the default value is 8 seconds which implies that there will be a delay of 8 seconds between successive HTTP requests).
headers (dict, optional) – The information added to the HTTP GET request that a user’s browser sends to a Web server containing the details of what the browser wants and will accept back from the server. (the default value is defined in
saveutils.SaveWebpages.headers
).use_logging (bool, optional) – Whether to log messages on console and file. The logging is setup according to the YAML logging file (the default value is False which implies that no logging will be used and thus no messages will be printed on the console).
**kwargs (dict) – TODO
- Variables
skipped_urls (dict [str, str]) – Stores the URLs that were skipped because of an error such as
OSError
orHTTP404Error
, along with the error message. The keys are the URLs and the values are the associated error messages.good_urls (set) – Stores the unique URLs that were successfully processed and saved.
checked_urls (set) – Stores the unique URLs that were processed (whether successfully or unsuccessfully) during the current session. Thus, checked_urls should equal to skipped_urls + good_urls.
db_conn (sqlite3.Connection) – SQLite database connection.
saver (
saveutils.SaveWebpages
) – For retrieving webpages and saving them in cache. Seesaveutils
.valid_domains (list) – Only URLs from these domains will be processed.
logging_filepath (str) – Path to the YAML logging file which is used to setup logging for all custom modules.
schema_filepath (str) – Path to music.sql schema for building the music database which will store the scraped data.
scraped_data (dict) – The scraped data is saved as a dictionary. Its structure is based on the database’s music.sql schema.
Notes
If the corresponding flags are activated, logging and database are setup in
__init__()
.By default, the scraped data is saved in a dictionary whose structure is described below (see
scraped_data
). The scraped data will also be saved if a database is given via db_filepath.See the structure of the music database as defined in the music.sql schema.
The scraped webpages can also be cached in order to reduce the number of HTTP requests to the server (See db_filepath).
-
get_lyrics_from_album
(album_title, artist_name=None, max_songs=None)[source]¶ TODO
- Parameters
album_title –
artist_name –
max_songs –
-
get_lyrics_from_artist
(artist_name, max_songs=None, year_after=None, year_before=None)[source]¶ TODO
- Parameters
artist_name –
max_songs –
year_after –
year_before –
-
get_scraped_data
()[source]¶ Return the scraped data as a dictionary.
This method returns all the data that was scraped from the lyrics webpages. If a database was used, the scraped data is also saved in the SQLite database file found at db_filepath
See scraped_data for a detailed structure of the returned dictionary.
- Returns
scraped_data – The scraped data whose content is described in
scraped_data
.- Return type
-
scraped_data
= {'albums': {'data': [], 'headers': ('album_title', 'artist_name', 'year')}, 'artists': {'data': [], 'headers': ('artist_name',)}, 'songs': {'data': [], 'headers': ('song_title', 'artist_name', 'album_title', 'lyrics_url', 'lyrics', 'year')}}¶ The scraped data is saved as a dictionary.
Its keys and values are defined as follow:
scraped_data = { 'albums': { 'headers': ('album_title', 'artist_name', 'year',), 'data': [] }, 'artists': { 'headers': ('artist_name',), 'data': [] }, 'songs': { 'headers': ('song_title', 'artist_name', 'album_title', 'lyrics_url', 'lyrics', 'year',), 'data': [] } }
Note
The ‘data’ key points to a list of tuple that eventually will store the scraped data from different URLs, i.e. each scraped data from a given URL is added as a tuple to the list.
-
search_song_lyrics
(song_title, artist_name=None)[source]¶ TODO
- Parameters
song_title –
artist_name –
-
start_scraping
()[source]¶ Start the web scraping of lyrics websites.
This method iterates through each lyrics URL from the main config file and delegates the important tasks (URL processing and scraping) to separate methods (
_process_url()
and_scrape_webpage()
).Notes
This method catches all exceptions that prevent a given URL of being processed further, e.g. the webpage is not found (404 Error) or the URL is not from a valid domain.
Any exception that is not caught here is redirected to the main script calling this method. See for example the main script
scripts.scraper
.
-
valid_domains
= ['www.azlyrics.com']¶
lyrics_scraping.scripts
¶
lyrics_scraping.scrapers
is a package that contains modules that define
scripts, e.g. scripts.scraper
.
lyrics_scraping.utils
¶
Description¶
Collection of utilities specifically for the lyrics scraping project.
Functions¶
-
utils.
get_data_filepath
(file_type)[source]¶ Return the path to a data file used by
lyrics_scraping
.The data file can either be the:
default_log: refers to the default logging configuration file used to setup the logging for all custom modules.
default_main: refers to the default main configuration file used to setup a lyrics scraper.
log: refers to the user-defined logging configuration file which is used to setup the logging for all custom modules.
main: refers to the user-defined main configuration file used to setup a lyrics scraper.
schema: refers to the SQL schema file music.sql used for creating the SQLite database which stores the scraped data.
- Parameters
file_type (str, {'default_log', 'default_main', 'log', 'main', 'schema'}) – The type of data file for which we want the path.
- Returns
filepath – The path to the data file.
- Return type
- Raises
AssertionError – Raised if the wrong type of data file is given to the function. Only {‘default_log’, ‘default_main’, ‘log’, ‘main’, ‘schema’} are accepted for file_type.
-
utils.
plural
(obj, plural_end='s', singular_end='')[source]¶ Add plural ending if a number is greater than 1 or there are many values in a list.
If the number is greater than one or more than one item is found in the list, the function returns by default ‘s’. If not, then the empty string is returned.
- Parameters
obj (int, float or list) – The number or list that will be checked if a plural or singular ending will be returned.
plural_end (str, optional) – The plural ending (the default value is “s” which implies that “s’” will be returned in the case that the number is greater than 1 or the list contains more than one item).
singular_end (str, optional) – The singular ending (the default value is “” which implies that an empty string will be returned in the case that the number is 1 or less, or the list contains 1 item).
- Returns
str – “s” if number is greater than 1 or more than one item is found in the list, “” (empty string) otherwise.
- Return type
“s” or “”
- Raises
TypeError – TODO