The Modern Algorithmic Toolbox (CS168), Spring 2022

CS 168: The Modern Algorithmic Toolbox, Spring 2022

Announcements

5/29: Greg's Monday OH will be moved to Wednesday, 3-4pm. They will be held under the tent right outside of the classroom (CEMEX). Feel free to drop by if you want to discuss your final miniproject, or anything else even vaguely related to the course.
5/24: Here is mini-project 9, with images. The miniproject is due Wednesday, June 1st at 11am. This project is a bit shorter to give ample time for you to also work on the 'final miniproject'.
5/18: Correction on miniproject 8, 3b: f+ should be padded out to length 2N-1, not 2N. [I just replaced the original miniproject upload with the corrected version of this.]
5/16: Mini-project #8 is available. Here is the laurel_yanny.wav audio clip for Part 3, and a zipped folder worldsYikes.zip containing examples that we created illustrating that the 10-point research-oriented bonus part is not impossible : ) Also feel free to see the research paper that grew out of the work of two former CS68 students on this bonus part. The miniproject is due Wednesday, May 25th at 11am.
5/9: Mini-project #7 is available. Here is the data file of national parks for Part 2. For Part 3, the QWOP scripts are available for MATLAB here and Python here. It is due Wednesday, May 18th at 11am.
5/2: Mini-project #6 is available. It is due Wednesday, May 11th at 11am.

Here is the data file for Part 2. 4/26: I updated miniproject 5, part 1e to ask that you give 3 analogies that are difficult for this word-vector approach. 4/25: Mini-project #5 is available (this is one of my favorite miniprojects....) It is due Wednesday, May 4th at 11am. The data files you will need are: dictionary.txt, co_occur.csv, analogy_task.txt, and p5_image.gif. 4/18: Mini-project #4 is available. It is due Wednesday, April 27th at 11am. Here is the dataset for Part 2 and here is the decoding file for this part. 4/14: Rachael kindly offered to permanently move her office hours to Tuesday 8-10pm ([Zoom link] ). This is not meant to encourage you to wait until then to work on the miniproject... 4/12: Mini-project #3 is now available. It is due Wednesday, April 20th at 11am. 4/11: Minor correction to miniproject 2, problem 3(d). The prescribed setting for d should be d = 15 log n, not d=30 log n. 4/5: Mini-project #2 is available. The dataset is available here. It is due Wednesday, April 13th (at 11am). [Note, I may make slight modifications to the last part tomorrow.] 4/1: We have (almost) finalized CA office hours, which will begin Monday 4/4. 3/29: Starting with tomorrow's lecture, you'll be able to attend remotely via this zoom link with passcode 137 105. (Please attend lecture in person if you are on campus and COVID-free!!!) 3/28: Mini-project #1 is available. It is due Wednesday, April 6th (at 11am). Please submit via Gradescope, entry code JBRB8N.

Here is some starter code for drawing histograms in Python (and/or check Stack Overflow). 3/28: A lecture video for the technical portion of today's lecture is posted on Canvas under the Panopto Course Videos tab. Lecture notes and links to the original paper on consistent hashing, and some other relevant links, are provided in the 'Lecture 1' portion of the 'Proposed Schedule' below. First lecture Monday, March 28th, 1:30-3pm in CEMEX. See you there! Lecture videos will be available via Canvas, though I strongly encourage you to attend lecture in person if possible---its more fun for all of us. To ensure that the lecture videos are reasonable quality, I might upload the corresponding lecture from last year's remote offering. Please refer to this course website (rather than lecture videos) for all administrative information. Ed site for online discussion/questions

Instructor:

Gregory Valiant (Office hours: Mon 3:15-4:15pm outside in the shaded treewell with whiteboards in the engineering quad). Contact: email me at my last name at stanford.edu or via our course Ed page.

Course Assistants:

Meredith Xu (Office hours: Monday 9am-11am, virtual [Zoom link] )
Emily Wen (Office hours: Monday 10am-noon, Huang Basement)
Avi Goyal (Office hours: Tuesday 9:30am-11am, Huang Basement)
Rachael Wang (Office hours: Tuesday 8-10pm, virtual [Zoom link] )
Ray Li (Office hours: Thursday 10am-noon, Huang Basement)
Tianyi Hao (Office hours: Thursday 3pm-5pm, location Huang 019)
Amber Yang (Office hours: Friday 8am-10am, virtual [Zoom link] )
Jiazheng Zhao (Office hours: Friday 1:30pm-3:30pm, virtual [Zoom link] )

Lecture Time/location: Mon/Wed, 1:30-3pm in CEMEX

Ed site for online discussion/questions

Prerequisites: CS107 and CS161, or permission from the instructor.

Course Description

This course will provide a rigorous and hands-on introduction to the central ideas and algorithms that constitute the core of the modern algorithms toolkit. Emphasis will be on understanding the high-level theoretical intuitions and principles underlying the algorithms we discuss, as well as developing a concrete understanding of when and how to implement and apply the algorithms. The course will be structured as a sequence of one-week investigations; each week will introduce one algorithmic idea, and discuss the motivation, theoretical underpinning, and practical applications of that algorithmic idea. Each topic will be accompanied by a mini-project in which students will be guided through a practical application of the ideas of the week. Topics include modern techniques in hashing, dimension reduction, linear and convex programming, gradient descent and regression, sampling and estimation, compressive sensing, and linear-algebraic techniques (principal components analysis, singular value decomposition, spectral techniques).

Proposed Lecture Schedule

Week 1: Modern Hashing
- Lecture 1 (Mon 3/28): Course introduction. ``Consistent'' hashing.
  - Lecture notes
  Supplementary material:
- Lecture 2 (Wed 3/30): Property-preserving lossy compression. From majority elements to approximate heavy hitters. From bloom filters to the count-min sketch.
  - Lecture notes
  Supplementary material:
Week 2: Data with Distances (Similarity Search, Nearest Neighbor, Dimension Reduction, LSH)
- Lecture 3 (Mon 4/4): Similarity Search. (Dis)similarity metrics: Jaccard, Euclidean, Lp. Efficient algorithm for finding similar elements in small/medium (ie. <20) dimensions using k-d-trees.
  - Lecture notes.
  Supplementary material:
- Lecture 4 (Wed 4/6): Curse of Dimensionality, kissing number. Distance-preserving compression. Estimating Jaccard similarity using MinHash. Euclidean distance preserving dimensionality reduction (aka the Johnson-Lindenstrauss Transform).
  - Lecture notes
  Supplementary material:
  - A nice survey of "kissing number", and some other strange phenomena from high dimensional spaces: Kissing Numbers, Sphere Packings, and some Unexpected Proofs (from 2000).
  - Origins of MinHash at Alta Vista: Broder, Identifying and Filtering Near-Duplicate Documents (from 2000).
  - Ailon/Chazelle, Faster Dimension Reduction, CACM '10.
  - Andoni/Indyk, Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions, CACM '08.
  - For much more on Locality Sensitive Hashing, see this chapter of the CS246 textbook (by Leskovec, Rajaraman, and Ullman).
Week 3: Generalization and Regularization
- Lecture 5 (Mon 4/11): Generalization (or, how much data is enough?). Learning an unknown function from samples from an unknown distribution. Training error vs. test error. PAC guarantees for linear classifiers. Empirical risk minimization.
  - Lecture notes
- Lecture 6 (Wed 4/13): Regularization. The polynomial embedding and random projection, L2 regularization, and L1 regularization as a computationally tractable surrogate for L0 regularization.
  - Lecture notes
  Supplementary material:
  - A relatively recent paper arguing that, to understand why deep learning works, we need to rethink the theory of generalization. This paper was quite controversial, with one camp thinking that its conclusions are completely obvious, and the other camp thinking that it is revealing an extremely deep mystery. You decide for yourself! Paper is here.
Week 4: Linear-Algebraic Techniques: Understanding Principal Components Analysis
- Lecture 7 (Mon 4/18): Understanding Principal Component Analysis (PCA). Minimizing squared distances equals maximizing variance. Use cases for data visualization and data compression. Failure modes for PCA.
  - Lecture notes
  Supplementary material:
  - A nice exposition by 23andMe of the fact that the top 2 principal components of genetic SNP data of Europeans essentially recovers the geography of europe: nice exposition w. figures. Original Nature paper: Genes mirror geography in Europe, Nature, Aug. 2008.
  - Eigenfaces (see also this blog post)
  - There's tons of PCA tutorials floating around the Web (some good, some not so good), which you are also permitted to refer to.
- Lecture 8 (Wed 4/20): How PCA works. Maximizing variance as finding the "direction of maximum stretch" of the covariance matrix. The simple geometry of "diagonals in disguise." The power iteration algorithm.
  - Lecture notes
Week 5: Linear-Algebraic Techniques: Understanding the Singular Value Decomposition
- Lecture 9 (Mon 4/25): Low-rank matrix approximations. The singular value decomposition (SVD), applications to matrix compression, de-noising, and matrix completion (i.e. recovering missing entries).
  - Lecture notes
- Lecture 10 (Wed 4/27): Tensor methods. Differences between matrices and tensors, the uniqueness of low-rank tensor factorizations, and Jenrich's algorithm.
  - Lecture notes
  Supplementary material:
  - A blog post discussing Spearman's original experiment and the motivation for tensor methods: here.
  - See chapter 3 of Ankur Moitra's course notes here for a more technical in depth discussion of tensor methods, and Jenrich's algorithm.
Week 6: Spectral Graph Theory
- Lecture 11 (Mon 5/2): Graphs as matrices and the Laplacian of a graph. Interpretations of the largest and smallest eigenvectors/eigenvalues of the Laplacian. Spectral embeddings, and an overview of applications (e.g. graph coloring, spectral clustering.)
  - Lecture notes.
  Supplementary material:
  - Dan Spielman's excellent lecture notes for his semester-long course on Spectral Graph Theory. The notes include a number of helpful plots.
  - Amin Saberi offered a grad course a few years ago that is covering some of the research frontier of Spectral Graph Theory.
- Lecture 12 (Wed 5/4):
  
  Spectral techniques, Part 2. Interpretations of the second eigenvalue (via conductance and isoperimetric number), and connections with the speed at which random walks/diffusions converge.
  - Lecture notes.
Week 7: Sampling and Estimation
- Lecture 13 (Mon 5/9): Reservoir sampling (how to select a random sample from a datastream). Basic probability tools: Markov's inequality and Chebyshev's inequality. Importance Sampling (how to make inferences about one distribution based on samples from a different distribution). Description of the Good-Turing estimate of the missing/unseen mass.
  - Lecture notes
  Supplementary material:
  - A nice description of the probabilistic tools/approach that went into Nate Silver's Senate election forecasting model (from 2014): here.
  - A paper of mine showing that one can accurately estimate the structure of the unobserved portion of a distribution---not just the total probability mass. paper link here.
- Lecture 14 (Wed 5/11): Markov Chains, stationary distributions. Markov Chain Monte Carlo (MCMC) as approaches to solving hard problems by sampling from carefully crafted distributions.
  - Lecture notes
  Supplementary material:
  - A basic description of MCMC, here.
  - Lecture notes from Persi Diaconis on MCMC, including a description of the MCMC approach to decoding substitution-ciphers, here.
  - Example of MCMC used for fitting extremely complex biological models: The human splicing code... Science, Jan. 2015.
  - For those interested in computer Go: here is the Jan, 2016 Nature paper from Google's DeepMind group.
Week 8: The Fourier Perspective (and other bases)
- Lecture 15 (Mon 5/16): Fourier methods, part 1.
  - Lecture notes for Lectures 15 and 16
  Supplementary material:
  - A very basic intro to Fourier transformations with some nice visualizations: here.
  - A book version of a Stanford course on the Fourier Transform, which has many extremely nice applications: pdf here.
- Lecture 16 (Wed 5/18): Fourier methods, part 2 (emphasis on convolutions).
  - Lecture notes combined with Lecture 15.
Week 9: Sparse Vector/Matrix Recovery (Compressive Sensing) and Mathematical Programming
- Lecture 17 (Mon 5/23): Compressive sensing.
  - Lecture notes
  Supplementary material:
  - A video and lecture notes from CS264 with more of the mathematics behind compressive sensing.
  - Survey talk by Candes from ICM 2014.
  - Rice's single-pixel camera
  - More on potential applications in cameras.
  - Developments in medical imaging.
  - More resources on compressive sensing.
- Lecture 18 (Wed 5/25): Linear and convex programming. Matrix completion.
  - Lecture notes
  Supplementary material:
  - Linear programming FAQ, out of date but still with lots of useful info.
  - For convex optimization at Stanford, start with Stephen Boyd.
  - For more on matrix completion, start with Chapter 7 of Moitra's notes.
Week 10: Privacy Preserving Computation
- Lecture 19 (Wed 6/1): Differential Privacy in data analysis and machine learning.
  - We won't have the usual lecture notes for this lecture, though the first chapter of the the Dwork/Roth book linked below is a fantastic starting place for learning about differential privacy.
  Supplementary material:
  - A fairly recent book on Differential Privacy by Cynthia Dwork and Aaron Roth: here (should be downloadable from stanford network).
  - The original paper of Dwork, McSherry, Nissim, and Smith, introducing differential privacy.
Coursework
- Assignments (90% of Grade): There will be 9 weekly mini-projects centered around the topics covered that week. Each mini-project contains both written and programming parts. You are strongly encouraged to work in teams of up to four students. If you work in a team only one member should submit all of the relevant files.
  
  For the written part, you are encouraged to use LaTeX to typeset your homeworks; we've provided a template for your convenience. We will be using the GradeScope online submission system. Please create an account on Gradescope using your Stanford ID and join CS168 using entry code JBRB8N.
  
  For the programming part, you are encouraged to use matlab (tutorial), Numpy and Pyplot in Python (Python tutorial, Numpy tutorial, Matplotlib tutorial), or some other scientific computing tool (with plotting). Here is a comprehensive python tutorial using Google's colab that was put together for the CS231n course, with examples of plotting using matplotlib at the very end.
  
  Assignments are released on Mondays, and are due at 11:59pm on Tuesdays the following week (both the written and the programming parts). No late assignments will be accepted, but we will drop your lowest assignment grade when calculating your final grade.
- Final Project (10% of Grade): You can turn the project in at any point up until the last lecture of the quarter, and can work in a team of at most four students. This project should take one of two forms:
  - Option 1) Design an alternate miniproject for one topic. Choose one of the weeks in the course, and think about the high-level take-home messages from that week. Craft a miniproject that explores some of those take-home messages. The miniproject you design should have the same sort of ``feel'' and length as the other miniprojects for the course. If your miniproject involves a dataset, please include a clear link to that dataset (google drive/dropbox/etc is fine) and description of what that dataset is. Please also include a sample-solution for the miniproject, together with a short (ie 1 or 2 paragraph) discussion/reflection on how the miniproject complements and/or reinforces the material for that week.
  - Option 2) Dive deeper into one of the topics we've covered. Choose a week or even a single lecture, and explore how these ideas have been extended beyond the material covered in lecture. What are some of the newer applications of these ideas? How have the ideas been optimized/tweaked? Where is the current research frontier in this direction? See where your curiosity takes you, and then in a short (3-10 page) paper, present/discuss what you learned. Your paper can resemble a survey/research paper, or lecture notes.
- Contributions to the 168 Community (Discretionary Grade Bumps): You are encouraged to help your fellow classmates when possible. This includes responding to Piazza questions when you know the answer, helping improve the course lecture notes by pointing out typos/errors/confusing parts, etc. [there will be a special Piazza folder specifically for providing this feedback on the lecture notes]. At the end of the course, I will bump up grades of those students who had the most positive impact on the 168 community, according to the (quite subjective) judgement of the TAs and me.
Collaboration Policy

You can discuss the problems at a high level with other groups and contact the course staff (via Ed or office hours) for additional help. And of course, you are encouraged to help respond to Ed questions .

You may refer to the course notes and research papers linked from the course webpage, and can also use other references that you find online. You may *NOT* consult solution sets or code from past offerings of this course. Services such as CourseHero are strictly forbidden, and if you try to sell my lecture notes or your solution sets to these sites, a well-paid lawyer representing Stanford will be in contact with you : ) [Please don't do this!] Of course, you are expected to understand everything that is written on any assignment turned in with your name; if you do refer to references, make sure you cite them appropriately, and make sure that all your words are your own.

You are also permitted to use general resources for whatever programming language you choose to use. None of the assignments require you to write much code, and *you may NOT reuse any code from friends/internet sources related to previous offerings of this course*. If you use helper functions or code snippets that you did not write yourself (aside from standard python packages, etc.) you must clearly cite the source in your writeup.
Please follow the honor code.

The Modern Algorithmic Toolbox (CS168), Spring 2022

CS 168: The Modern Algorithmic Toolbox, Spring 2022

Week 1: Modern Hashing

Week 2: Data with Distances (Similarity Search, Nearest Neighbor, Dimension Reduction, LSH)

Week 3: Generalization and Regularization

Week 4: Linear-Algebraic Techniques: Understanding Principal Components Analysis

Week 5: Linear-Algebraic Techniques: Understanding the Singular Value Decomposition

Week 6: Spectral Graph Theory

Week 7: Sampling and Estimation

Week 8: The Fourier Perspective (and other bases)

Week 9: Sparse Vector/Matrix Recovery (Compressive Sensing) and Mathematical Programming

Week 10: Privacy Preserving Computation

Recommend

GitHub - quantopian/zipline: Zipline, a Pythonic Algorithmic Trading Library

Algorithmic complexity can be that simple! - Imgur

Data Science in Algorithmic Trading

A handbook to demystify algorithmic differentiation

Algorithmic-Pseudocode : A collection of language independent pseudocodes for In...

Algorithmic trading based on mean-variance optimization in Python

GitHub - MTrajK/coding-problems: Solutions for various coding/algorithmic proble...

Explore Modern Management on a Bonus Scrum Master Toolbox Podcast with Vasco

Visual Studio Toolbox Live - What's New in Visual Studio 2022 Previews 1 & 2...

Visual Studio Toolbox Live - XAML Development in Visual Studio 2022

About Joyk