## Scraping politely

A lot of projects require scraping websites. I usually write a scraper, run it, it fetches all of the data, and then fails in some final step before writing it anywhere. Then I curse a bit and try to fix my program without being sure what the responses actually looked like. Then I rerun my script, crossing my fingers that I don’t go over any rate limits.

This isn’t optimal, so I’ve finally come up with a better system for this. My requirements are:

2. …for a given time period (e.g., a day). If I rerun after that time period, download the page again.
3. Make everything human-readable. I want to be able to easily find the response for a given request and visually inspect it.
4. Basic rate limiting support.
5. Not reinvent the wheel.

So basically, if I request http://httpbin.org/anything?foo=bar I want it to save the response to a file like ./.db/cache/2021-07-31/httpbin.org_anything_foo_bar. Then I can cat the file and see the response (or delete it to “clear” the cache). However, URLs can be much longer than legal filenames (and the human-readable scheme above could cause collisions), so I’m going to compromise and store the response in file with an opaque hash for a name (e.g., ./.db/cache/2021-07-31/e23403ee51adae9260d7810e2f49f0f2098d8a25c3581440d25d20d02e00ccb9) and then have a CSV file in the directory that maps request URL -> hash. It’s not quite as user-friendly as being able to just visually examine the filename, but I can just do:

$cat ./.db/cache/2021-07-31/cache_map.csv | grep 'foo=bar' e23403ee51adae9260d7810e2f49f0f2098d8a25c3581440d25d20d02e00ccb9,http://httpbin.org/anything?foo=bar  I’m using Python, so for not reinventing the wheel, I decided to use requests-cache. The requests-cache package actually has an option to write responses to the filesystem, but I wanted some custom behavior: 1) the cache_map.csv file as described above and 2) naming cache directories by date. Thus, I implemented a custom storage layer for requests-cache to use. requests-cache represents storage as a dict: each URL is hashed and then requests-cache calls the getter or setter for that hash, depending on if it’s reading or writing. Thus, to implement custom storage, I just have to implement the dict interface to read/write to the filesystem, plus keep my cache_map.csv up-to-date: class FilesystemStorage(requests_cache.backends.BaseStorage): def __init__(self, **kwargs): # I'm using APIs that return JSON, so it's easiest to # use the built-in JSON serializer. super().__init__(serializer='json', **kwargs) # A cache a day keeps the bugs at bay. today = datetime.datetime.today().strftime('%Y-%m-%d') self._storage_dir = os.path.join('.db/cache', today) if not os.path.isdir(self._storage_dir): os.makedirs(self._storage_dir, exist_ok=True) # The map of filename hashes -> URLs. self._cache_map = os.path.join(self._storage_dir, 'cache_map.csv') # Load any existing cache. self._cache = self._LoadCacheMap() def _LoadCacheMap(self) -> Dict[str, str]: if not os.path.exists(self._cache_map): return {} # Using pandas is overkill, but are you even a data # scientist if you don't? return pd.read_csv(self._cache_map, index_col='filename')['url'].to_dict() # Dict implementation. def __getitem__(self, key: str) -> requests_cache.CachedResponse: if key not in self._cache: raise KeyError k = os.path.join(self._storage_dir, key) with open(k, mode='rb') as fh: content = fh.read() # I want to be able to get the URL from the response, # so adding it here. url = self._cache[key] return requests_cache.CachedResponse(content, url=url) def __setitem__(self, key: str, value: requests_cache.CachedResponse): # Note that key is already hashed, so we use value's # URL attribute to get the human-readable URL. k = os.path.join(self._storage_dir, key) with open(k, mode='wt') as fh: json.dump(value.json(), fh) # Update cache map self._cache[key] = value.url # Write the cache back to the file system. ( pd.Series(self._cache, name='url') .rename_axis('filename') .to_frame() .to_csv(self._cache_map) ) # I don't plan on using these, so didn't both implementing them. def __delitem__(self, key): pass def __iter__(self): pass def __len__(self) -> int: return len(self._cache)  Now I add a simple cache class to use this custom storage: class FilesystemCache(requests_cache.backends.BaseCache): """Stores a map of URL to filename.""" def __init__(self, **kwargs): super().__init__(**kwargs) storage = FilesystemStorage(**kwargs) self.redirects = storage self.responses = storage  Note that I’m using the same instance of my cache for both responses and redirects. This isn’t optimal if I were actually expecting redirects, but I’m not and my storage layer is designed to be a singleton (as implemented, multiple instances would clobber each other). Now I create a request class that uses my custom cache. import requests_cache from typing import Any, Dict from lib import custom_cache class Requester(object): def __init__(self): self._client = requests_cache.CachedSession( backend=custom_cache.FilesystemCache()) def DoRequest(self, url: str) -> Dict[str, Any]: resp = self._client.get(url, headers=_GetHeader()) body = resp.json() # The API I'm using always has a 'data' field in valid # responses, YMMV. if 'data' not in body: raise ValueError('Unexpected response: %s' % resp.text) return body  This reads auth info from environment variables: def _GetHeader() -> Dict[str, str]: return {'Authorization': 'Bearer %s' % _GetBearerToken()} def _GetBearerToken() -> str: bearer_token = os.getenv('bearer_token') if not bearer_token: raise RuntimeError('No bearer token found, try source setup.env') return bearer_token  Finally, I want to support rate limiting. I used the ratelimit package for this. ratelimit is based on the Twitter API, which rate limits on 15-minute intervals. So if I was hitting an endpoint that allowed 10 requests/minute (10*15 = 150 requests per 15 minutes) then I could write: @ratelimit.sleep_and_retry @ratelimit.limits(calls=150) def DoApiCall(self, url) -> Dict[str, Any]: return self._requester.DoRequest(url)  This will block the program’s main thread if this function is called more frequently than the allowed rate limit (which may not be what you want, check the ratelimit docs for other options). The downside of this implementation is that it still rate limits, even if you’re hitting the cache. You could get around this by checking the cache contents in Requester and then only conditionally calling DoApiCall, but this is left as an exercise for the reader 😉 ## Road to Thornmire Yesterday, Andrew and I started working on a driveway for the undeveloped parcel of thorny, swampy woodland we bought during lockdown. We rented a chainsaw at an equipment rental place, where the guy asked if we had ever used one before. We had not. He showed us how to start it: open the choke, pull the string thing (technical term) vigorously a couple of times until you hear the motor almost catch. Then close the choke, one more vigorous pull and the engine catches. Easy. We headed out to our woods a half-hour away, parked alongside the road, and determined a plan of attack. There’s an old stone wall that we wanted the driveway to run along, so we figured out the angle we needed to cut and tried to start the chainsaw. And failed. And failed. And failed. And then it smelled like gas we realized we had probably flooded the engine. And so there we were, sitting on the side of the road, Googling how to fix a flooded chainsaw engine. Then calling the rental place. Then finally driving back to the rental place, where they showed us how to get it going again and gave us updated instructions (pro tip: you shouldn’t open the choke at all if the saw is already warmed up). When we got back to our property, the chainsaw started right up and I started cutting down saplings and hacking a path through the undergrowth. For every one minute of chopping, I had to stop the chainsaw, put it down, and rest, because it was so damn heavy. (It’s also freakin loud.) After we cut a narrow path about 30′ into the underbrush (of the ~1000′ we need to cut), the chain jumped the track. Some more Googling later, and we realized we didn’t have the hexwrenches we needed to get it back on. This was all, of course, very frustrating and, in some ways, a huge waste of time. It certainly wasn’t how we were planning to spend the day. However, we learned a ton, so I’m counting it as not really a waste. We figured out a bunch of things that didn’t work and have a better idea of what to try next time (ear protection, bringing hexwrenches, rent a Brush Hog for the small stuff). My arms/shoulders/back are all noodles today, so we are going to be built by the end of this. And machete-ing through the woods is pretty satisfying. ## Optimizing resource allocation Every year, I go to GenCon: a gaming conference where tens of thousands of nerds descend on Indianapolis to try out new board games, RPGs, and other assorted nerdery. Indianapolis is no stranger to huge conferences, but GenCon stretches the city to its limits. GenCon buys thousands of hotel rooms throughout the city and then doles them out by lottery to the attendees. The hotels closest to the conference center sell out immediately, then it gradually filters out to the further away/more expensive options. A day after GenCon’s housing portal opens, every hotel room in Indianapolis is booked. Putting up with/hacking around this annoying system for a decade inspired me to create a theory of resource allocation that, while I can’t imagine is original, I’ve never heard anyone else talk about. When you control a finite resource that a lot of people want, there are three groups that you should allocate it to: • The deserving. These are the people you think will best use the resource: artists and superfans and young people who want to go so bad they will wait in line all night. These people might need subsidies (or at least reasonably-priced options) to be able to partake. However, they are expected to materially improve the quality of the conference/neighborhood/magnet school. • A random lottery. You think you know what will make a great conference/neighborhood/magnet school, but no one really knows the secret sauce. If you think of the previous group as a garden, this group is the wildflowers. • The rich. These people will subsidize the other groups/the event itself. Basically all resource allocations can be broken down into some division between these three groups, and the interesting question is “what proportion should deserving vs. lucky vs. rich be?” They all have different strength and weaknesses. The rich might not add much of anything culturally. The deserving may stultify into an old guard that prevents innovation. The randos might be useless. Right now, GenCon is 99.9% random lottery, with a handful of slots for rich people to get preferential access to housing. This means that they are missing out on a lot of the extra money they could be making from their more well-heeled patrons. They are also excluding a lot of passionate fans getting rooms in preference to someone who is mostly there to hang out in the hot tub. Covid vaccines are another interesting case. Who should get a vaccine? The breakdown we’ve gone with is the deserving (front line workers, the elderly, etc.) then random. If vaccine rollout is blocked on money, what if providing a rich person with a shot a few weeks early could fund ten shots for front-line workers? A thousand? A million? If not, is there any number where you’d let someone you didn’t like get a shot early for the sake of humanity? The problem with letting the rich pay their way in is that it feels so unfair. Lotteries feel fair. Letting in deserving people also feels fair (albeit subject to how deserving-ness is measured). In contrast, it’s infuriating when someone who already has enough gets more benefits. However, if you cut off the ways wealthy people can access a resource, they’re not going to just shrug and give up. They’ll just go outside the system: buy the most desirable hotel rooms a year in advance, send their kid to an expensive private school, or use their connections to get a vaccine. If we build ways to serve the wealthy in the system, the whole system can benefit from their resources. Systems often default to a 0% allocation for the rich, because it feels fairer. However, I think it’s not usually the optimal choice, it’s just the easiest one to make a case for. ## Intro to Altair Altair is a beautiful graphing library for Python. I’ve been using it a lot recently, but it was a real struggle to get started with. Here’s the guide I wish I’d had. I’m going to be using https://colab.research.google.com/, but this should work fine in any other interactive notebook you want to use. ## Getting started First, you’re going to want to import numpy and pandas as well as altair. They’ll make working with data easier. import altair as altimport numpy as np import pandas as pd To start with, we’ll generate a random dataframe and graph it using pandas. It’ll use matplotlib and look pretty ugly: Instead, if you use altair: Not much prettier, but it’s a start. There are several important things to note: • There are three separate parts to creating this graph: 1. Passing in the data you’re using (the alt.Chart call). 2. What kind of marks you want. There are dozens of options: dots, stacks, pies, maps, etc. Line is a nice simple one to start with. 3. What x and y should be. These should be the names of columns in your dataframe. • From point #3 above: Altair does not understand your indexes. You have to reset_index() on your dataframe before you pass it to Altair, otherwise you can’t access the index values. (The index becomes a column named “index” above.) • The API is designed to chain calls, each building up more graph configuration and returning a Chart object. The default behavior for showing a returned chart is displaying it. Using this slightly more complicated configuration, you get a more attractive graph that you can do more with. However, as you try to do more with Altair, it just feels… not quite right. And it took me a while to figure out why. ## Why Altair’s API feels weird Why doesn’t Altair let you pass in a column (instead of a column name)? Why is typing and aggregation done in strings? Why is the API so weird in general? The reason (I think) is that Altair is a thin wrapper around Vega, which is a JavaScript graphing library. Thus, if you take the code above and call to_json(), you can get the Vega config (a JSON object) for the graph: chart = alt.Chart(df.reset_index()).mark_line().encode( x='index', y='val' ) print(chart.to_json())  { "$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 400
}
},
"data": {
"name": "data-54155f6e9cef9af445e6523406ab9d2b"
},
"datasets": {
"data-54155f6e9cef9af445e6523406ab9d2b": [
{
"index": 0,
"val": 0.772999594224295
},
{
"index": 1,
"val": 0.6175666167357753
},
{
"index": 2,
"val": 0.824746009472559
},
{
"index": 3,
"val": 0.23636915023034855
},
{
"index": 4,
"val": 0.730579649676023
},
{
"index": 5,
"val": 0.507522783979701
},
{
"index": 6,
"val": 0.6662601853327993
},
{
"index": 7,
"val": 0.39232102729533436
},
{
"index": 8,
"val": 0.9814526591403565
},
{
"index": 9,
"val": 0.6932117440802663
}
]
},
"encoding": {
"x": {
"field": "index",
"type": "quantitative"
},
"y": {
"field": "val",
"type": "quantitative"
}
},
"mark": "line"
}

The cool thing about Vega charts is that they are self-contained, so you can copy-paste that info into the online Vega chart editor and see it.

In general, I’ve found there are slightly confusing Python equivalents to everything you can do in Vega. But sometimes I’ve run into a feature that isn’t yet supported in Python and had to drop into JS.

## Lipstick on the pig

We can give everything on this chart a nice, human-readable name by passing a title to the constructor, x, and y fields:

alt.Chart(df.reset_index(), title='Spring Rainfall').mark_line().encode(
x=alt.X('index', title='Day'),
y=alt.Y('val', title='Inches of rainfall'),
)


You can also use custom colors and such, but the last graph I made someone asked why it was puke-colored, so that’s left as an exercise to the reader.

## Poking things

The real strength of Altair, I think, is how easy it is to make interactive graphs. Ready? Add .interactive().

alt.Chart(df.reset_index(), title='Spring Rainfall').mark_line().encode(
x=alt.X('index', title='Day'),
y=alt.Y('val', title='Inches of rainfall'),
).interactive()


Now your graph is zoomable and scrollable.

However, you might want to give more information. In this totally made up example, suppose we wanted to show who had collected each rainwater measurement. Let’s add that info to the dataframe, first:

rangers = (
pd.Series(['Rick', 'Scarlett', 'Boomer'])
.sample(10, replace=True)
.reset_index(drop=True))
df = df.assign(ranger=rangers)


Now we’ll add tooltips to our chart:

alt.Chart(df.reset_index(), title='Spring Rainfall').mark_line().encode(
x=alt.X('index', title='Day'),
y=alt.Y('val', title='Inches of rainfall'),
tooltip='ranger',
).interactive()


Which results in:

Pretty nifty! Give it a try yourself in a colab or the Vega editor, and let me know what you think!

## Adventures in modern web programming

At this point, I’ve fallen so far behind of where JS developers are that I don’t think I’ll ever be able to figure out what’s going on. However, Vercel is a portfolio company of GV’s, so I decided to give it a valiant effort.

Thus, I started at vercel.com. I went through their deploy flow for a Gatsby template, linked my GitHub account, and ended up with a static webpage. This created a new Gatsby repository on my GitHub account. Unfortunately, I also have no idea how to use Gatsby. However, I’ve also been meaning to learn Gatsby, so let’s dive in.

I cloned the repository and opened up in Visual Studio Code. Unfortunately, I’m not super familiar with VS Code, either, so then I had to look up how to add the damn folder to my workspace. (The weird thing about working at Google is that I have the best tools in the world at my disposal… just not the ones anyone else in the world uses.)

One quick StackOverflow search later, I’m suspiciously inspecting index.js in VS Code. This seems to be the business end of the app, but unfortunately I’m not familiar with React nor Helmet, both of which seem to be doing some lifting here.

Usually I’ve found the best way to learn a new thing is to mess around with it, so let’s start by changing the front end. I change the h1, commit, and push.

I head to the Vercel equivalent of the github page (e.g., my repo is github.com/kchodorow/gatsby, so my Vercel dashboard for it is https://vercel.com/kchodorow/gatsby. Nice. After a second, it updates and shows my new commit as the deployed version. Very nice. It also has been emailing me about its actions each step, which is a bit much for a personal project but would be nice in general.

Okay, time to get serious. How do I actually connect Vercel to a backend? Googling around for this, it looks like I’m going to be writing serverless functions. Guess what else I’m not familiar with? However, this looks interesting. Basically I can put node.js functions in files like api/foo.ts and this becomes a server request my app can make (/api/foo). I rename date.ts to hello.ts and push it out.

Vercel displays “Build failed.” Clicking on it, It gives me the build logs:

I take a look at index.js and realize that there’s some code that calls the backend function and loads it into a variable, which I completely neglected to change. Well, that’s good, just having {hello} work would be a bit too magic for my blood (and how would nested directories in /api be specified?). I update index.js and this time, cleverly, run yarn run build before pushing.

Sigh. Fine. I install yarn. Then I run yarn. It immediately fails because I needed to run npm install first. So I install dependencies, then I run yarn. Success! A push later, a successful build, and:

Verdict: Vercel is very cool. And I feel a little less behind the curve.

See the actual code behind this paragon of frontend programming at https://github.com/kchodorow/gatsby.

## Hassle-free LaTeX with Overleaf

There is something delightful about LaTeX. However, the last time I bothered with it was in college, since I don’t have much call for PDFs in day-to-day life. I recently came across Overleaf, which is an online LaTeX editor. The nice part is that it live-renders your work and you can right-click->Save as an PNG. Thus, you can suddenly embed gorgeously formatted math anywhere. For example, here’s one of my favorite proofs, that the square root of two is not a rational number:

Source code:

\documentclass[varwidth=true, border=10pt]{standalone}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}

\begin{document}

Suppose $\sqrt{2}$ was rational. Then we could write:

$\sqrt{2} = \frac{a}{b}$

...where $a/b$ is in lowest terms. Squaring both sides yields:

$2 = \frac{a^{2}}{b^{2}}$

Now multiply both sides by $b^{2}$:

$2b^{2} = a^{2}$

$a^{2}$ must be even, since $b^{2}$ is multiplied by 2. For $a^{2}$ to be even, $a$ must be even, so we can say that $a = 2c$ for some $c$.

Thus, we can write this equation as:

$2b^{2} = (2c)^{2}$

or:

$2b^{2} = 4c^{2}$

Now we can divide both sides by 2... but we end up with $b^{2} = 2c^{2}$, which is shaped the same as $2b^{2} = a^{2}$ above!

We can continue expanding this equation out forever, so there are no whole numbers that $a$ and $b$ can resolve to.

Thus, $\sqrt{2}$ is irrational.

\end{document}


Gorgeous.

## Risking it all

Sorry to keep posting financial stuff, but whatever, it’s my blog.

It’s interesting how the amount of investment risk that a human can put up with is very relevant to how much they have invested, and it isn’t linear. Let’s take the case of three investors, all of whom currently can invest $1k/month and need$1M in assets to live comfortably off of assets alone. While more is more, suppose these people aren’t particularly driven to keep accumulating wealth beyond their needs ($1M). They start with: • Investor A:$1,000 in investments
• Investor B: $1,000,000 in investments • Investor C:$1,000,000,000 in investments

To simplify things, we’ll say they keep 100% of their assets in stocks. Now, let’s say the market plunges by 90%: $100 invested in the market is now worth$10. What happens to each investor?

• Investor A now has $100 in investments • Investor B now has$100,000 in investments
• Investor C now has $100,000,000 in investments I would argue that investors A & C are in a similar boat here, ironically. Investor A started out .1% of the way towards their goal and next month, they will be back to that. Not much has changed for them: the market set them back by one month. Conversely, it doesn’t really matter what happens to Investor C’s portfolio. They’re doing fine regardless: greater than 100% of the money they need is still greater than 100%, even if it’s less than before. Thus, Investor B is the only one in the danger zone. They were exactly at their investment goal, and now they’re only 1/10th of the way there! Theoretically, they’re now 7 years (900 months) away from$1M!

I was reading about “bond tents” as a way to defend against stock market crashes at retirement: you don’t want a market crash right when you retire, because then you’ll sell your stocks and have no way to replenish them to take advantage of the market recovery. (This is called sequence of returns risk, which ERN does a great job explaining.) Thus, it’s a good idea to increase your bond allocation going into your retirement so you don’t have to sell any stocks if there is a crash. Bond tents might be a good mechanism for investors like Investor B, too: if you’re near your goal you have more to lose than any other time.

## Part 4: compounding returns

As Einstein (maybe) said, compounding interest is the eighth wonder of the world. In the previous posts in this series, we used a very linear benchmark: 4% off of the amount contributed forever. However, this is a weird way to benchmark results. Imagine you and friend (call him Baelish) are both investing and comparing results. You start off by investor $1k in August, 2020 and hope to have$1040 in one year. One year later, you have succeeded and have $1040. You tell your friend Peter Baelish about your investment and he puts$1040 in the market and sets a goal of making 4%: he’ll try to have $1081.60 by next year. However, if you stick with the model used in the previous post, your goal for the year will only be$1080 by 2022, leaving you a whole $1.60 poorer than your friend. (Arguably. We are talking about the benchmark, not the actual amount of money you’re earning. However, this whole series could probably be titled “It’s easy to mislead yourself that you’re doing better than you are,” so it’s fitting the theme) So instead of a fixed amount, we want to continuous compound our benchmark. Because each value in the benchmark is determined based on the previous one, this is not Pandas-friendly. We’ll have to drop into “normal Python” to compute a more accurate benchmark. Let’s stick with SPY, making once-a-month$1k contributions with the goal of a 4% annual return. The only difference is that we now want a 4% compounding annual return.

There are 253 trading days in a year and the formula for continuously compounding returns is:

Pt = P0 * ert

That is, the amount you have at time t (Pt) is the amount you have at the beginning of the period (P0), multiplied by the constant e raised to the rate (4%) times the amount of time. For example, if we contribute $1k and wait 1 year, we’ll want to have: P1 =$1000 * e4% * 1 year = $1000 * e.04 * 1 =$1040.81

We get an extra $.81 cents off of the continuous compounding. However, we want to be able to graph this benchmark by the day, so our t isn’t 1, it’s 1 year / 253 days = .00395. Plugging this in, each day we should have e.00016 more than the previous day. Using the code from previous posts, this gives us: import math benchmark = [] idx = 0 for day in df.index: contribution = 0 if day in my_portfolio.index: contribution = my_portfolio.loc[day].total_cost_basis # The first time through, benchmark is [] (False), so it just adds # todays_benchmark. Each subsequent iteration it uses the previous row. if benchmark: today = benchmark[-1] * math.exp(.004 * (1/253)) + contribution else: today = contribution benchmark.append(today) df.assign(benchmark=benchmark)[['total', 'benchmark']].plot()  This is, honestly, extremely similar to the original chart. After all, at the end of 1 year we only expect them to have diverged by ~80 cents per$1000! However, over time this number grows, which is clearer if we subtract our cost basis from the benchmark:

(df.assign(cost_basis=my_portfolio.total_cost_basis.cumsum())
.fillna(method='ffill')
.assign(performance=lambda x: x.benchmark - x.cost_basis)
.performance
.plot())


The nice thing about investing is that your money works “while you sleep” (depending on timezone). Tightening up the benchmark line by using a compounding benchmark lets us hold our money accountable when reviewing performance.

## A multi-stock portfolio: comparing returns

The last posts have discussed portfolio performance with a very boring portfolio: one stock! Things get more interesting when there’s more than one stock to compare.

Let’s say we have a two stock portfolio now: SPY (as before) and DOCU (Docusign). We’ll combine the two tickers into one dataframe:

spy = pandas_datareader.data.DataReader('SPY', 'yahoo', start='2020-05-01', end='2020-07-31')
df = (
pd.concat([spy.assign(ticker='SPY'), docu.assign(ticker='DOCU')])
.reset_index()
.set_index(['ticker', 'Date'])
.sort_index())


Now that things are getting more complicated, I’m going to switch to graphing with Altair, which I’ve found to be a more sophisticated (and prettier) alternative to matplotlib (which is what df.plot gives you).

import altair as alt
alt.Chart(df.reset_index()).mark_line().encode(
x='Date',
y='Close',
color='ticker',
)


Both stocks are doing well and it looks like DOCU is doing a bit better. So what would happen if we had invested $1000 in each of these stocks each month? We’ll create a random portfolio in a similar way to the first post, but notice that now we have to group by ticker symbol when we want ticker-specific info. import numpy as np # Get a random price on the first of the month _for each ticker_. first_of_the_month = df[df.index.get_level_values(1).day == 1] my_portfolio = ( first_of_the_month .apply(lambda x: np.random.randint(x.Low, x.High), axis=1) .rename('share_cost') .to_frame() .assign( # Add a cost basis column to show$1k/month.
cost_basis=1000,
shares_purchased=lambda x: 1000/x.share_cost,
)
# Here's where the multiple tickers becomes interesting. We only want to
# sum over rows for the same stock, not all rows.
.assign(
cost_basis=lambda x: x.groupby('ticker').cost_basis.cumsum(),
total_shares=lambda x: x.groupby('ticker').shares_purchased.cumsum(),
)
)
my_portfolio


Add my portfolio back into the main dataframe:

df = (
df
.assign(
shares_owned=my_portfolio.total_shares,
cost_basis=my_portfolio.cost_basis,
)
.fillna(method='ffill')
.assign(total_value=lambda x: x.shares_owned * x.Close)
)
alt.Chart(df.reset_index()).mark_line().encode(
x='Date',
y='total_value',
color='ticker',
)


This makes it nicely clear that, while SPY grew our investment by a couple hundred over this period, DOCU trounced it, nearly returning $2k on the same investment. …or did it? After all, these gains aren’t realized. You don’t actually have$4917.68 in your wallet, you have shares that are worth that on the market. And to get that into your wallet, you need to pay taxes. So, a somewhat less exuberant way of looking at this is to include taxes in our calculations.

Everyone has a different tax situation, but since most people reading this are probably developers, let’s assume you’re in a pretty high tax bracket (40% income tax, 20% long-term cap gains). For the first year we hold this stock, profits will be taxed at 40% if we sell. After that, we could decrease the taxes to 20%, but we only have 3 months here, so let’s say we’re still in short-term gains territory.

df = df.assign(
# Capital gains
gain=lambda x: x.total_value - x.cost_basis,
taxes=lambda x: x.gain * .4,
realizable_value=lambda x: x.total_value - x.taxes,
)


If the gain is less than zero it’s a loss and could be used to cancel out gains, but I’m not handling that here (technically I don’t have to because there aren’t any loses, but I hear that sometimes the stock market goes down).

If we subtract taxes and our original investment and pretend we sold on the last day on the chart (July 31), we get:

(df.realizable_value - df.cost_basis).groupby('ticker').last()

ticker
DOCU 1150.608974
SPY 166.252482
dtype: float64

DOCU still does amazing (although less so), SPY’s returns are starting to look fairly ho-hum. Although less dramatic, I like this way of looking at returns since it’s more reflective of what you actually end up with. And a good incentive to let stocks marinate for a year.

## Show me the money: tracking returns

Last post went over building a very simple portfolio tracker to show a portfolio’s performance over time. However, it would be easy to trick myself: “My portfolio value is going up over time, I’m doing great!” But I’m also adding money to my portfolio over time, so that money shouldn’t “count” in terms of performance. I really want to be able to see the difference between having stashed the money in my mattress vs. put it into the market. We’ll figure out how to graph that in this post.

Last post, we ended with this chart:

Let’s say that we don’t care about cash invested, only profits and losses. To see that, we can subtract out the cost basis and see what the raw performance looks like:

df = df.assign(total_profit=df.total_value - df.cash_pos)
_ = df.total_profit.plot()


This gives a pretty nice breakdown of how I’m doing vs. keeping money in a mattress. However, how am I doing vs. my goals? Say my goal is to return at least 4%/year, but I can’t just draw a line from 0% January 1 to 4% December 31 because the money wasn’t invested in a lump sum. Money that’s been sitting there for a year should have yielded 4%, but money that was put in 1 month ago should yield 1/12 of that.

I think the easiest way to model this is to figure out how much our cost basis (cash_pos) should return per day, and then take a cumulative sum as the days pass to get the total expected return for any given day. (This isn’t perfectly accurate, but good enough for my purposes.)

benchmark_percent = .04  # 4%
trading_days_per_year = 253  # Market isn't open 365 days/year