emoen/Machine-Learning-for-Asset-Managers: Implementation of code snippets, exer ...

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称（OpenSource Name）：

emoen/Machine-Learning-for-Asset-Managers

开源软件地址(OpenSource Url)：

https://github.com/emoen/Machine-Learning-for-Asset-Managers

开源编程语言(OpenSource Language)：

Python 100.0%

开源软件介绍(OpenSource Introduction)：

Install Library

..with pip install -U git+https://github.com/emoen/Machine-Learning-for-Asset-Managers

>>> from Machine_Learning_for_Asset_Managers import ch2_fitKDE_find_best_bandwidth as c
>>> import numpy as np
>>> c.findOptimalBWidth(np.asarray([21,3]))
{'bandwidth': 10.0}

Machine-Learning-for-Asset-Managers

Implementation of code snippets and exercises from Machine Learning for Asset Managers (Elements in Quantitative Finance) written by Prof. Marcos López de Prado.

The project is for my own learning. If you want to use the consepts from the book - you should head over to Hudson & Thames. They have implemented these consepts and many more in mlfinlab. Edit: seems like some of theyr work - like jupyter notebooks - has gone behind a paywall.

For practical application see the repository: Machine-Learning-for-Asset-Managers-Oslo-Bors.

Note: In chapter 4 - there is a bug in the implementation of "Optimal Number of Clusters" algorithm (ONC) in the book (the code from the paper - DETECTION OF FALSE INVESTMENT STRATEGIES USING UNSUPERVISED LEARNING METHODS, de Prado and Lewis (2018) -
is different but is also incorrect ) https://quant.stackexchange.com/questions/60486/bug-found-in-optimal-number-of-clusters-algorithm-from-de-prado-and-lewis-201

The divide and conquer method of subspaces used by ONC can be problematic because if you embed a subspace into a space with a large eigen-value. The larger space can distort the clusters found in the subspace. ONC does precisely that - it embeds subspaces into the space consisting of the largest eigenvalues found in the correlation matrix. An outline describing the problem more rigorously can be found here: https://math.stackexchange.com/questions/4013808/metric-on-clustering-of-correlation-matrix-using-silhouette-score/4050616#4050616

Other clustering algorithms should be investigated like hierarchical clustering.

Chapter 2 Denoising and Detoning

Marcenko-Pasture theoretical probability density function, and empirical density function:


Figure 2.1:Marcenko-Pasture theoretical probability density function, and empirical density function:

Denoising a random matrix with signal using the constant residual eigenvalue method. This is done by fixing random eigenvalues. See code snippet 2.5


Figure 2.2: A comparison of eigenvalues before and after applying the residual eigenvalue method:

Detoned covariance matrix can be used to calculate minimum variance portfolio. The efficient frontier is the upper portion of the minimum variance frontier starting at the minimum variance portfolio. A denoised covariance matrix is less unstable to change.

Note: Excersize 2.7: "Extend function fitKDE in code snippet 2.2, so that it estimates through cross-validation the optimal value of bWidth (bandwidth)".

The script ch2_fitKDE_find_bandwidth.py implements this procedure and produces the (green) KDE in figure 2.3:


Figure 2.3: Calculated bandwidth(green line) together with histogram, and pdf. The green line is smoother. Bandwidth found: 0.03511191734215131

From code snippet 2.3 - with random matrix with signal: the histogram is how the eigenvalues of a random matrix with signal is distributed. Then the variance of the theoretical probability density function is calculated using the $fitKDE$ as the empirical probability density function. So finding a good value for bandwidth in fitKDE is needed to find the likeliest variance of the theoretical mp-pdf.


Figure 2.4: histogram and pdf of eigenvalues with signal

Chapter 3 Distance Metrics

definition of a metric:
1. identity of indiscernibles d(x,y) = 0 => x=y
2. Symmetry d(x,y) = d(y,x)
3. triangle inequality.
- 1,2,3 => non-negativ, d(x,y) >= 0
pearson correlation
distance correlation
angular distance
Information-theoretic codependence/entropy dependence
- cross-entropy: H[X] = - Σ_{s ∈ S_X} p[x] log (p[x])
- Kullback-Leilbler divergence: D_KL[p||q] = - Σ_{s ∈ S_X} p[x] log (q[x]/p[x]) = p[x] Σ_{s ∈ S} log (p[x]/q[x])
- Cross-entropy: H_c[p||q] = H[x] = D_KL[p||q]
- Mutual information: Decrease in uncertainty in X from knowing Y: I[X,Y] = H[X] - H[X|Y] = H[X] + H[Y] - H[X,Y] = E_X[D_KL[p[y|x]||p[y]]]
- variation of information: VI[X,Y] = H[X|Y] + H[Y|X] = H[X,Y] - I[X,Y]. It is uncertainty we expect in one variable given another variable: VI[X,Y] = 0 <=> X=Y
- Kullback-Leilbler divergence is not a metric while variation of information is.

>>> ss.entropy([1./2,1./2], base=2)
1.0
>>> ss.entropy([1,0], base=2)
0.0
>>> ss.entropy([1./3,2./3], base=2)
0.9182958340544894

1 bit of information in coin toss
0 bit of information in deterministic outcome
less than 1 bit of information in unfair coin toss

Angular distance: p_d = sqrt(1/2 - (1-rho(X, Y)))
Absolute angular distance: p_d = sqrt(1/2 - (1-|rho|(X, Y)))
Squared angular distance: p_d = sqrt(1/2 - (1-rho^2(X, Y)))

Standard angular distance is better used for long-only portfolio appliacations. Squared and Absolute Angular Distances for long-short portfolios.

Chapter 4 Optimal Clustering

Use unsupervised learning to maximize intragroup similarities and minimize intergroup similarities. Consider matrix X of shape N x F. N objects and F features. Features are used to compute proximity(correlation, mutual information) to N objects in an NxN matrix.

There are 2 types of clustering algorithms. Partitional and hierarchical:

Connectivity: hierarchical clustering
Centroids: like k-means
Distribution: gaussians
Density: search for connected dense regions like DBSCAN, OPTICS
Subspace: modeled on two dimension, feature and observation. Example

Generating of random block correlation matrices is used to simulate instruments with correlation. The utility for doing this is in code snippet 4.3, and it uses clustering algorithms optimal number of cluster (ONC) defined in snippet 4.1 and 4.2, which does not need a predefined number of clusters (unlike k-means), but uses an 'elbow method' to stop adding clusters. The optimal number of clusters are achieved when there is high intra-cluster correlation and low inter-cluster correlation. The silhouette score is used to minimize within-group distance and maximize between-group distance.


Random block correlation matrix. Light colors indicate a high correlation, and dark colors indicate a low correlation. In this example, the number of blocks K=6, minBlockSize=2, and number of instruments N=30

Applying the ONC algorithm to the random block correlation matrix. ONC finds all the clusters.

Chapter 5 Financial Labels

Fixed-Horizon method
Time-bar method
Volume-bar method

Tiple-Barrier Method involves holding a position until

Unrealized profit target achieved
unrealized loss limit reached
Position is held beyond a maximum number of bars

Trend-scanning method: the idea is to identify trends and let them run for as long and as far as they may persists, without setting any barriers.


Example of trend-scanning labels on sine wave with gaussian noise:


trend-scanning with t-values which shows confidence in trend. 1 is high confidence going up and -1 is high confidence going down.

An alternative to look-forward algorithm as presented in the book is to use look-backward from the latest data-point to the window-size. E.g. if the latest data-point is at index 20 - and the window size is between 3 and 10 days. The look-backward algorithm will scan window at index 17 to 20 all the way back to index 11 to 20. Hence only considering the most recent information.


trend-scanning with t-values using look-backwards

Chapter 6 Feature Importance Analysis

"p-value does not measure the probability that neither the null nor the alternative hypothesis is true, or the significance of a result."


p-Values computed on a set of informative, redundant, and noisy explanatory variables. The explanatory variables has not the hightest p-values.

"Backtesting is not a research tool. Feature importance is." (Lopez de Prado) The Mean Decrease Impurity (MDI) algorithm deals with 3 out of 4 problems with p-values:

MDI is not imposing any tree structure, algebraic specification, or relying on any stochastic or distributional characteristics of the residuals (e.g. y=b₀+b₁*x_i+ε)
betas are estimated from single sample, MDI relies on bootstrapping, so the variance can be reduced by the numbers of trees in the random forrest ensemble.
In MDI the goal is not to estimate a coefficient of a given algebraic equation (b_hat_0, b_hat_1) describing the probability of a null-hypotheses.
MDI does not correct of calculation in-sample, as there is no cross-validation.


MDI algorithm example

Figure 6.4 shows that ONC correctly recognizes that there are six relevant clusters(one cluster for each informative feature, plus one cluster of noise features), and it assigns the redundant features to the cluster that contains the informative feature from which the redundant features where derived. Given the low correlation across clusters, there is no need to replace the features with their residuals.

Next, apply the clustered MDI method to the clustered data:


Figure 6.5 Clustered MDI

Clustered MDI works better han non-clustered MDI. Finally, apply the clustered MDA method to this data:


Figure 6.6 Clustered MDA

Conclusion: C_5 which is associated with noisy features is not important, and all other clusters has similar importance.

Chapter 7 Portfolio Construction

Convex portfolio optimization can calculate minimum variance portfolio and max sharp-ratio.

Definition Condition number: absolute value of the ratio between the maximum and minimum eigenvalues: A_n_n / A_m_m. The condition number says something about the instability of the instability caused by covariance structures. Definition trace = sum(diag(A)) - its the sum of the diagonal elements

Highly correlated time-series implies high condition number of the correlation matrix.

Markowitz's curse

The correlation matrix C is stable only when the correlation $\ro = 0$ - when there is no correlation.

Hierarchical risk parity (HRP) outperforms Markowit in out-of-sample Monte-Carlo experiments, but is sub-optimal in-sample.

Code-snippet 7.1 illustrates the signal-induced instability of the correlation matrix.

>>> corr0 = mc.formBlockMatrix(2, 2, .5)
>>> corr0
array([[1. , 0.5, 0. , 0. ],
       [0.5, 1. , 0. , 0. ],
       [0. , 0. , 1. , 0.5],
       [0. , 0. , 0.5, 1. ]])
>>> eVal, eVec = np.linalg.eigh(corr0)
>>> print(max(eVal)/min(eVal))
3.0


Figure 7.1 Heatmap of a block-diagonal correlation matrix

Code-snippet 7.2 creates same block diagonal matrix but with one dominant block. However the condition number is the same.

>>> corr0 = block_diag(mc.formBlockMatrix(1,2, .5))
>>> corr1 = mc.formBlockMatrix(1,2, .0)
>>> corr0 = block_diag(corr0, corr1)
>>> corr0
array([[1. , 0.5, 0. , 0. ],
       [0.5, 1. , 0. , 0. ],
       [0. , 0. , 1. , 0. ],
       [0. , 0. , 0. , 1. ]])
>>> eVal, eVec = np.linalg.eigh(corr0)
>>> matrix_condition_number = max(eVal)/min(eVal)
>>> print(matrix_condition_number)
3.0

This demonstrates bringing down the intrablock correlation in only one of the two blocks doesnt reduce the condition number. This shows that the instablility in Markowitz's solution can be traced back to the dominant blocks.


Figure 7.2 Heatmap of a dominant block-diagonal correlation matrix

The nested Clustered Optimization Algorithm (NCO)

NCO provides a strategy for addressing the effect of Markowitz's curse on an existing mean-variance allocation method.

step: cluster the correlation matrix
step: compute optimal intracluster allocations, using the denoised covariance matrix
step: compute optimal intercluster allocations, using the reduced covariance matrix which is close to a diagonal matrix, so optimization problem is close to ideal markowitz case when $\ro$ = 0

Chapter 8 Testing set overfitting

Backtesting is a historical simulation of how an investment strategy would have performed in the past. Backtesting suffers from selection bias under multiple testing, as researchers run millions of tests on historical data and presents the best ones (overfitted). This chapter studies how to measure the effect of selection bias.

Precision and recall

Precision and recall under multiple testing

The sharpe ratio

Sharpe Ratio = μ/σ

The 'False Strategy' theorem

A researcher may run many historical simulations and report only the best one (max sharp ratio). The distribution of max sharpe ratio is not the same as the expected sharpe ratio. Hence selection bias under multiple replications (SBuMT).

Experimental results

A monte carlo experiment shows that the distribution of the max sharp ratio increases (E[max(sharp_ratio)] = 3.26) even when the expected sharp ratio is 0 (E[sharp_ratio]). So an investment strategy will seem promising even when there are no good strategy.

When more than one trial takes place, the expected value of the maximum Sharpe Ratio is greater than the expected value of the Sharpe Ratio, from a random trial (when true Sharpe Ratio=0 and variance > 0).


Figure 8.1 Comparison of experimental and theoretical results from False Strategy Theorem

The Deflated Sharpe Ratio

The main conclusion from the False Strategy Theorem is that, unless $max_k{SR^_k}>>E[max_k{SR^_k}], the discovered strategy is likely to be false positive.

Type II errors under multiple testing

The interaction between type I and type II errors

Appendix A: Testing on Synthetic data

Either from resampling or monte carlo

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

eugeneyan/applied-ml: 发布时间：2022-08-18

jubins/MachineLearning-Detecting-Twitter-Bots: Custom classification algorithm t ...发布时间：2022-08-18

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：19635|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：10105|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8403|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8760|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8713|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9764|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8700|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：8065|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8748|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7597|2022-11-06

客服电话

电子邮件

emoen/Machine-Learning-for-Asset-Managers: Implementation of code snippets, exer ...

开源软件名称（OpenSource Name）：

开源软件地址(OpenSource Url)：

开源编程语言(OpenSource Language)：

开源软件介绍(OpenSource Introduction)：

Install Library

Machine-Learning-for-Asset-Managers

Chapter 2 Denoising and Detoning

Chapter 3 Distance Metrics

Chapter 4 Optimal Clustering

Chapter 5 Financial Labels

Chapter 6 Feature Importance Analysis

Chapter 7 Portfolio Construction

Markowitz's curse

The nested Clustered Optimization Algorithm (NCO)

Chapter 8 Testing set overfitting

Precision and recall

Precision and recall under multiple testing

The sharpe ratio

The 'False Strategy' theorem

Experimental results

The Deflated Sharpe Ratio

Type II errors under multiple testing

The interaction between type I and type II errors

Appendix A: Testing on Synthetic data

请发表评论

全部评论

上一篇：

下一篇：

CVE-2022-35344

bradtraversy/iweather: Ionic 3 mobile we

joaomh/curso-de-matlab

断牙刷新位置时间（断牙属性及刷新位置介绍

rugk/mastodon-simplified-federation: Sim

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053