05-12-2022 , 04:34 PM
hi, these are quite a bit tougher than the average question posted here, and the 2nd one is conceptual curiosity on my part,

1) for portfolio theory........ say the S&P 500 didn't include Apple for some reason.... and then you add it to the S&P 500 index and it is 10% weight..... before AAPL was added, we had 3 vital portfolio theory stats 1) correlation between AAPL and SPY (that didn't include AAPL), 2) AAPL volatility, 3) SPY volatility........ after adding AAPL, we now have new SPY volatility and AAPL volatility (which hasn't changed).......but what is AAPL's correlation to NEW SPY index?..... never seen this calculation........ two old real world examples would be 1) royal dutch shell dominating Netherlands market, and 2) Nortel/BCE (owns piece of NT) dominating Canadian index. so both countries had with/without/constrained indices......

2) I'm trying to grasp slope vs. R-squared, when the x and y variables have the same variance..... say I have a binary variable, does the stock market go up that month?... and I find slope and correlation is 0.2 (very respectable). and I'll say intercept = 0.6...... so if the stock market was down this month, 60% chance its up next month. if it's up, then 80% chance it's up next month.... BUT I was taught that R2 = amount explained by regression. so whether the market is up/down this month explains 4% of whether it's up/down next month. that doesn't seem conceptually right to me.

05-12-2022 , 06:09 PM
You've asked a variation of (1) a couple of times before and I tried to link you this thread:

https://quant.stackexchange.com/ques...ined-portfolio

I'm on my phone so can't type very well, but you can get the 2x2 variance-covariance matrix from:

"before AAPL was added, we had 3 vital portfolio theory stats 1) correlation between AAPL and SPY (that didn't include AAPL), 2) AAPL volatility, 3) SPY volatility"

Lets calls these:

a - volatility SPY
b - volatility AAPL
c - correlation between SPY and AAPL

If you are measuring volatility in terms of standard deviations:

Cov(a, b) = [ [ a^2, abc], [abc, b^2] ]

If you are measuring volatility in terms of variance:

Cov(a, b) = [ [ a, sqrt(ab)*c], [sqrt(ab)*c, b] ]

Quote:
You can obtain the covariance between 2 portfolios by multiplying the row vector, containing the weights of portfolio A with the variance-covariance matrix of the assets and then multiplying with the column vector, containing the weights of assets in portfolio B.
Cov = [0.9, 0.1] * Cov(a, b) * [0, 1]^T

Var = [0.9, 0.1] * Cov(a, b) * [0.9, 0.1]^T

Corr = Cov / sqrt(Var)*b

(or Corr = Cov / sqrt(Var*b) if b is variance)

Juk
05-12-2022 , 06:15 PM
I'm not 100% sure what you are asking in (2) but might be able to help if you explain a bit more?

Juk
05-12-2022 , 07:27 PM
Not sure this really answers (2) bit this might help visualize what's going on:

Null model:

y = c

The value of c which minimizes the sum of squared errors is simply the mean.

Linear regression model:

y = ax + c

The values of the a and c coefficients which minimize the sum of squared errors is the linear regression solution.

So now imagine you get your original datasets and make two new datasets by subtracting the outputs of the respective models above.

(I'd recommend you actually try plotting these two datasets to get an idea of what they look like and how they are centered around the mean. Quite often you can see that your model specification needs to be changed and/or your independant variables transformed simply by looking at the residual plots!)

What R^2 is telling you then is now much the variance is reduced between these two datasets.

This works whatever your model is, eg:

y = ax^2 + bx + c

y = a*log(x) + c

y = a*I(x<k) + c

and so on.

-----------

Quote:
say I have a binary variable, does the stock market go up that month
I should add that the above R^2 interpretation is only valid if you are minimizing the sum of squares errors (which you can do with a binary dependent variable; but it's usually not advisable).

If your target is binary then you'd usually use something like logistic/probit regression and there isn't an actual R^2 measurement for this. There are several "pseudo-R^2" measurements, with McFadden's being the most well known... It still uses the difference between the null-model and fitted model idea; but instead of variance explained it outputs a value between 0 and 1, where 0 is the negative-log-loss of a model outputting the mean and 1 is the negative-log-loss of a perfect model.

Juk

Last edited by jukofyork; 05-12-2022 at 07:35 PM.
05-12-2022 , 09:11 PM
thanks... wasn't sure if I'd asked it here before.. and thought if 2 years had passed then it might not matter.

but I appreciate the response very much.. and apologize if I made you do the same thing (and it does look like quite a bit work) twice.

on the second one..... maybe let's not use binary. but I mostly do time-series, so it may be different than cross-section in terms of visualizing it.

I'll put it very basically but not mathematically at all.

it's shocks me that I can have a correlation of 30%.... very impressive for the work I do......... and then if you square it, you get 9%....

(if I use binary variables where correlation and slope are very close, it's really easy to visualize it.. but you say I should use other models.. I will look into that. I thought I was doing "rough" profit/logit and had just discovered "probit/logit" in a very crude manner.

of course if you can get a variable that forecast the stock market return on a monthly basis with a correlation of 30%, your returns would be phenomenal...... (as an aside, any market timing stuff I do that has 30-40% correlation in terms predicting the market would be 1) annual data (so small sample size); 2) benefit from look-ahead bias no matter what you do, 3) the future may not be like the past.

thanks again
05-13-2022 , 06:00 AM
I might have misunderstood what you asked but to be clear: It's fine to use binary variables as the inputs to linear regression. It just doesn't make much sense to use linear regression where the output is binary for a couple of reasons:

- Your targets are constrained to [0, 1] yet the output would allow for anything in the range [-inf, +inf].

- The loss function penalizes being "too right" (eg: when your target is 0 yet your output is negative, or when your target is 1 and your output is greater than 1).

The other part of what I was trying to get at was that the concept of "slope" is unrelated to R^2 and correlation (eg: decision trees break the range up into piecewise constant, ect).

Quote:
it's shocks me that I can have a correlation of 30%.... very impressive for the work I do......... and then if you square it, you get 9%....
That's because they are being measured on different scales. You can only really compare correlation with standard deviation, OR covariance with variance.

Juk

Last edited by jukofyork; 05-13-2022 at 06:07 AM.
05-13-2022 , 02:45 PM
here's where I think I, and especially people who know little about probability get confused:

say the market goes up 70% of months, and that is very stable.. but you don't know which months, and there may be complex patterns within the data (bad months may cluster for example)

I think people with no real knowledge think that if they predict the market to go up each month that they have explained 70% of the market's variation.

but to be explaining any variation in predicting winning months, you have to be choosing more or less than 70% of months as winning months and be correct more than 70%... sorry that's confusing.............. you need a success rate greater than 70% to explain any of the variance in binary up/down (sorry, I know I should use logit/probit, but I don't know them yet. and I find binary regression makes things much more simple.

another way to put it, I think, is that 70% up months is basically the intercept in this case.

m