Polygraph

RLHF Mechanistic Interpretability and Deception

Stars

6

Forks

2

Language

None

Last Updated

Aug 26, 2023

Similar Repos

Repo	Language	Stars	Description	Updated At
CircuitsVis	Jupyter Notebook	37	Mechanistic Interpretability Visualizations using React	May 08, 2023
CircuitsVis	Jupyter Notebook	2	Mechanistic Interpretability Visualizations using React	Dec 11, 2023
Grokking	Python	5	A Mechanistic Interpretability Analysis of Grokking	Jun 13, 2023
mechanisticinterpretability	None	2	A repository for awesome resources in mechanistic interpretability	Mar 08, 2023
transformer-visualization	Jupyter Notebook	2	Mechanistic Interpretability Tutorials, Results and research log as I learn from @neelnanda-io's wonderful Easy-Transformer	Mar 23, 2023
Emergent-World-Representations-Othello	Jupyter Notebook	3	A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello	Oct 25, 2023
SpyWear	Python	3	A game of deception and explosions.	Jul 22, 2016
rlhf_langchain	Jupyter Notebook	2	Langchain for RLHF	Mar 09, 2023
Thyrosim.jl	Julia	3	Thyroid hormone mechanistic model	Dec 29, 2022
HoneyProcs	C	3	Deception Technology for Endpoints	Oct 12, 2021
torwolf	JavaScript	20	A game of communication, deception, and media	May 23, 2017
OpenStoryTeller	Jupyter Notebook	2	Story Teller based on RLHF and GPT	Apr 27, 2023
stable-rlhf	Python	2	RLHF Pipeline for StableLM	Apr 26, 2023
code-adversary	Jupyter Notebook	4	Deception and bias-detection code for code LLMs	Aug 23, 2022
Python-Honeypot	Python	285	OWASP Honeypot, Automated Deception Framework.	Aug 08, 2022
Dejavu	JavaScript	337	DejaVU - Open Source Deception Framework	Jul 17, 2022
darkdeception	None	2	Dark Deception - Türkçe Yama (Gayriresmî)	May 02, 2021
captum	Python	3353	Model interpretability and understanding for PyTorch	Aug 10, 2022
trelawney	None	2	General Interpretability Package	Feb 27, 2020
AmI	Jupyter Notebook	8	Attacks Meet Interpretability	May 17, 2022
filecoin-mecha-twin	Jupyter Notebook	11	Mechanistic model for the Filecoin Economy	Mar 18, 2023
swindler	JavaScript	3	A social deception game based on culture and art	Oct 11, 2020
Echelons	None	4	Echelons of Deception and Survival — Competitive Online Multiplayer Game	Mar 19, 2023
awesome-rlhf	None	2	Lists of datasets, training, and evals for RLHF and similar	Jul 05, 2023
churn-prediction-with-text-and-interpretability	Jupyter Notebook	7	Predict customer churn with text and interpretability.	Oct 24, 2021
TorchEsegeta	Python	10	TorchEsegeta: Interpretability and Explainability pipeline for PyTorch	May 31, 2022
EpiForecastStatMech	Python	6	Exploring methods for merging mechanistic and models to forecast epidemics.	May 19, 2022
interpretability-methods	Python	3	gradients based interpretability methods	Jul 06, 2021
talk-textual-interpretability	JavaScript	2	Talk on textual interpretability	Jul 20, 2020
othelloscope	Jupyter Notebook	2	Interpretability Hackathon 2.0 entry	Apr 16, 2023
mli-resources	Jupyter Notebook	2	Machine Learning Interpretability Resources	Aug 15, 2019
nemesis	Python	8	Reward Model framework for LLM RLHF	May 09, 2023
TowerOfDeception	HTML	3	Tower of Deception for BG2:ToB, BG2:EE and EET	Oct 18, 2022
counter-reconnaissance-program	Python	3	Proof-of-concept cyber deception utility emulating Samba and LibSSH	Feb 16, 2022
AIX360	Python	1152	Interpretability and explainability of data and machine learning models	Sep 03, 2022
rapido	C++	5	Repeatable Analysis Programming for Interpretability, Durability, and Organization	Mar 01, 2023
SEI-SEIR_Arboviruses	R	2	Climate-based mechanistic model of arbovirus transmission in Ecuador and Kenya.	Dec 01, 2023
mechanismEncoder	HTML	4	Developing patient-specific phosphoproteomic models using mechanistic autoencoders	Aug 11, 2022
sussy	C#	8	2-4 players fast-paced party game of strategy and deception	Apr 18, 2022
LLaMA-Efficient-Tuning	Python	4	Fine-tuning LLaMA with PEFT (SFT+RLHF)	May 28, 2023
interp	TypeScript	8	Redwood Research's transformer interpretability tools	Apr 28, 2022
mli-resources	Jupyter Notebook	452	H2O.ai Machine Learning Interpretability Resources	Aug 14, 2022
mli-resources	Jupyter Notebook	2	H2O.ai Machine Learning Interpretability Resources	Feb 08, 2022
SurvLIMEpy	Python	7	Local interpretability for survival models	Apr 26, 2023
agent	Python	16	Interpretability dashboard for reinforcement learners	Nov 25, 2021
instructGOOSE	Jupyter Notebook	87	Implementation of Reinforcement Learning from Human Feedback (RLHF)	Mar 29, 2023
LLaMA-MOSS-RLHF-LoRA	Python	5	用RLHF可选LoRA对LLaMA和MOSS进行训练\|Training LLaMA or MOSS with RLHF [LoRA]	Jun 12, 2023
interaction_interpretability	None	3	Feature Interaction Interpretability via Interaction Detection	Oct 25, 2020
plain-english-legalese	Python	2	Increase the accessibility and interpretability of law to the layperson	Oct 11, 2021
DeceptionSemioticSquares	HTML	2	Using Scattertext to examine publicly available datasets about deception	Dec 14, 2018