Seed Papers
These are dedicated small lists of papers that can be used as seeds for the seminar. Papers are loosely categorized for your convenience. You can come up with other categories that are not listed here and suggest papers if you believe they fit the domain of AI4CA such as
- “alternative machine learning models for code-related tasks”, e.g., S4,
- “code generation as planners” (deliberately left vague), e.g., ViperGPT 🐍,
- “neuro-symbolic methodologies for code-related tasks”. In the latter case, the advisors will consider accepting your proposed topic.
Use this to understand the domain we are working on and find related and more up-to-date papers through
- looking up authors (or conferences) of interesting papers to find their new papers on Google Scholar, Semantic Scholar, DBLP, etc.,
- online tools such as Connected Papers, Papers With Code, etc.,
- and other possible ways you can imagine …
Abbreviations
- Code Classification: cc
- Code Clone Detection: ccd
- Code Completion: cco
- Code Documentation: cdo
- Code Summarization: cs
- Code Search: cse
- Code Retrieval: cr
- Code Translation: ct
- Program Similarity: ps
- Program Repair: pr
- Vulnerability Identification: vi
- Variable Misuse Prediction: vmp
- Execution-based evaluation: ee
- Constrained Decoding: cd
Papers are loosely labeled with:
- related major subject: section name (e.g. Analysis)
- items have the structure of:
- extra related subject: (see Abbreviations above, e.g. cc=Code Classification) {cc}
- nickname for the model/dataset presented: e.g. (CodeBLEU)
Analysis
- {ps} CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
- Bugs in Large Language Models Generated Code: An Empirical Study
- What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study
- {vi} A Survey of Trojans in Neural Models of Source Code: Taxonomy and Techniques
- {vi} Prompt Injection Attacks and Defenses in LLM-Integrated Applications
Datasets
- (GraphGen4Code) A Toolkit for Generating Code Knowledge Graphs
- (ChatDANCE) You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search
- CROSSCODEEVAL: A Diverse and Multilingual Benchmark for Cross-File Code Completion
- {ee} Top Leaderboard Ranking = Top Coding Proficiency, Always? EVOEVAL: Evolving Coding Benchmarks via LLM
- {ee} EffiBench: Benchmarking the Efficiency of Automatically Generated Code
- {ee} (EvalPerf) Evaluating Language Models for Efficient Code Generation
Pre-training
- (GraphCodeBERT) GraphCodeBERT: Pre-training Code Representations with Data Flow
- (SynCoBERT) SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
Transformer-based
- {cs} Language-Agnostic Representation Learning of Source Code from Structure and Context
- {cs,cco} Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy
- Gorilla: Large Language Model Connected with Massive APIs
- HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position
Graph-based
- Large Language Models on Graphs: A Comprehensive Survey
- Multi-View Graph Representation for Programming Language Processing: An Investigation into Algorithm Detection
- GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
- GraphGPT: Graph Instruction Tuning for Large Language Models
- GraphLLM: Boosting Graph Reasoning Ability of Large Language Model
- Rethinking Positional Encoding in Tree Transformer for Code Representation
- (GoT) Boosting Logical Reasoning in Large Language Models through a New Framework: The Graph of Thought
State Space Models and Transformers for Code Understanding
- (BiGS) Pretraining Without Attention
- LOCOST: State-Space Models for Long Document Abstractive Summarization
- (ModernBERT-Large-Instruct) It’s All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
- (SPADE) Efficient Long Sequence Modeling via State Space Augmented Transformer
- SMR: State Memory Replay for Long Sequence Modeling
- Semantic Word and Sentence Embeddings Compression using Discrete Wavelet Transform
- (LongMem) Augmenting Language Models with Long-Term Memory
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Representation Learning (Embeddings)
- {vi,ccd} (DISCO) Towards Learning (Dis)-Similarity of Source Code from Program Contrasts
- Neural Code Comprehension: A Learnable Representation of Code Semantics
- TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation
- Flow2Vec: Value-Flow-Based Precise Code Embedding
- (SCodeR) Soft-Labeled Contrastive Pre-training for Function-level Code Representation
- Code Representation Learning At Scale
- (Corder) Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations
Code Generation
- (PPOCoder) Execution-based Code Generation using Deep Reinforcement Learning
- RLTF: Reinforcement Learning from Unit Test Feedback
- {cd} SynCode: LLM Generation with Grammar Augmentation
- {cd} ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation
- {vi} AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing
- Security Attacks on LLM-based Code Completion Tools
- {cco} RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion