23

Building a Budget News-Based Algorithmic Trader? Well then You Need Hard-To-Find...

 4 years ago
source link: https://mc.ai/building-a-budget-news-based-algorithmic-trader-well-then-you-need-hard-to-find-data - part-2/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Datasets

Datasets are a favorite for accessing mass data quickly. If the correct data is available, datasets provide an invaluable speedup in algorithm development time due to being able to download and use masses of data quickly. I searched dozens of database archives from google’s dataset service to Kaggle . Surprisingly, the only source that was able to provide truly useful datasets was Kaggle, and they actually had multiple!

Pros:

Lots of Information — The more information available, the easier it is to learn and discover trends, it is a reason why the classic dataset has never fallen out of style!

Quick compilation — When the dataset is downloaded, it is incredibly fast to access, train on, and use the dataset, leading to fast development times.

Cons:

Hard to update — Updating a dataset you did not create is a challenging task, and even so, you may suffer from stitching together sources that do not quite match. You may be at the mercy of the dataset creator to release a new version, or may only perpetually have access to old data, both significant drawbacks of the classic dataset.

Hard to find — It is much easier to find a news API than a news dataset. Even if you do find a dataset, finding one that exactly matches the problem you are trying to solve is unlikely. This may make using a dataset an impossible option.

This dataset contains articles from Bloomberg, CNBC, Reuters, WSJ, and Fortune from January to May of 2018. The total size of the dataset is over 1 gigabyte, containing thousands upon thousands of articles and metadata.

Pros:

Ample Data — 1 Gigabyte is by far the largest dataset I found. This means whether you only want specific tickers or general news, any user should have no problem extracting the information they need from this dataset. It also has tons of metadata including what entity the article is about and the sentiment towards that entity.

Reliable Data — The dataset contains reputable sources only, providing reliable news coverage to base your algorithm on.

Cons:

Short time span — 5 months of data is a small sample size. The market was stable and doing well over this period of time, which could cause unreliable learning.

Messy data — the data is sorted by article while the date and associated entities are lodged in the metadata. This means there is likely substantial data-wrangling required before this dataset could be usable.

Overall:

This dataset is a great starting point for data collection. It has the significant drawback of lacking a large timespan but if you take it as a starting point, and fill in supplemental data since May 2018, this dataset could prove valuable!

This dataset is pretty lightweight but is by far the most intriguing dataset on this list. It includes articles spanning 2006 to 2016 for Microsoft and Apple only. Each date contains the open and close prices, as well as a string of all headlines from the New York Times that dealt with said company. The dataset contains sentiment analysis on the combined headline string indicating if a positive or negative sentiment is detected.

Pros:

Long Timespan — 10 years of headlines are ample data to train, test, and validate an algorithm, and this can be even further improved by adding additional data in a similar methodology.

Supplemental Information — The built-in stock prices and sentiment analysis columns make this a dataset training ready! A lot of additional steps like natural language processing are done for you!

Reliable Data — Data comes directly from the New York Times, and while this isn’t a diverse source of data, it is a reliable and consistent source.

Cons:

Only 2 tickers — It could be dangerous to learn off of 2 tickers and extrapolate to other stocks. It is a shame this dataset does not contain 20+ tickers from different sectors! Apple and Microsoft are also both successful companies, which could introduce unwanted survivor bias.

Data is getting old — Only having data as recent as 2016 could hurt when wanting to create an algorithm to trade today. This may require a decent amount of backfilling the missing information to be usable.

Lack of Metadata — The information provided is only strings of headlines. This lacks in-depth metadata and article content that could prove useful.

Overall:

This dataset is great for learning how to build an algorithmic trader. It provides a good amount of data on 2 tickers and provides extra analysis. If you want to grab a dataset and begin training, there is no better option than this one! I would be cautious to use this as your only data source, however. Especially if are looking to build a comprehensive algorithm. The drawbacks of older data and not very much information hold back what is otherwise a great dataset.

This dataset contains the top 25 upvoted world news retrieved each day from Reddit’s world news forum spanning from 2008 until 2016. It also contains the Dow Jones Industrial Average data as well as a boolean, 0 if the Dow closed lower that day, and a 1 if it closed higher.

Pros:

Long Timespan — 8+ years with 25 headlines per day is ample data to train, test, and validate an algorithm, and this can be even further improved by adding additional data in a similar methodology.

Well made — The dataset is well organized and ready to be utilized for algorithm development. The dataset was produced by a professor for use in a deep learning course, so it is naturally made easy to use.

Cons:

Data Validity — Pulling headlines based on what users upvoted and downvoted can introduce bias into the algorithm. Reddit is also not vetted for the validity of the upvoted news sources.

Data is getting old — Only having data as recent as 2016 could hurt when wanting to create an algorithm to trade today. This may require a decent amount of backfilling information to be usable.

Not specific — The data is only from world news, not financial news or individual symbols, so extracting specific financial articles is not possible.

Overall:

This is the most well-rounded dataset of the three. It provides ample data, a great timespan, and the opportunity for a user to easily add to it, augment it with techniques like NLP, or use it to get an algorithm developed quickly. This convenience and ample free data, however, comes at the drawback of data reliability.

Overall Impression of Datasets

All of these datasets provide ample free data incredibly quickly. However, none of these datasets are perfect. They all suffer from their own drawbacks that could limit their usefulness and are all from 2018 or older. The benefit to datasets, however, is they provide a great starting point for adding some historical context to your free API or web scraper!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK