Seed Papers
Below is a list of papers that can be used as seeds for the seminar. They are loosely categorized for your convenience. You can also suggest papers not listed here if you believe they fit the domain of AI4CA.
Use your seed paper to understand the topic you are working on and find related and more up-to-date papers. For this, you can use the following means:
- Search engines such as Google Scholar, Semantic Scholar, or DBLP.
- Tools such as Connected Papers or Papers With Code.
- Homepages of relevant conferences or researchers (e.g., looking up the authors and conference of your seed paper).
Evaluation
- CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
- What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study
- CROSSCODEEVAL: A Diverse and Multilingual Benchmark for Cross-File Code Completion
- Top Leaderboard Ranking = Top Coding Proficiency, Always? EVOEVAL: Evolving Coding Benchmarks via LLM
- EffiBench: Benchmarking the Efficiency of Automatically Generated Code
- (EvalPerf) Evaluating Language Models for Efficient Code Generation
- Re-Evaluating Code LLM Benchmarks Under Semantic Mutation
Transformer Architectures
- Language-Agnostic Representation Learning of Source Code from Structure and Context
- Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy
- HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position
Graph Architectures
- Large Language Models on Graphs: A Comprehensive Survey
- Multi-View Graph Representation for Programming Language Processing: An Investigation into Algorithm Detection
- GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
- GraphGPT: Graph Instruction Tuning for Large Language Models
- GraphLLM: Boosting Graph Reasoning Ability of Large Language Model
- Rethinking Positional Encoding in Tree Transformer for Code Representation
- (GoT) Boosting Logical Reasoning in Large Language Models through a New Framework: The Graph of Thought
- (GraphGen4Code) A Toolkit for Generating Code Knowledge Graphs
State Space Architectures
- (BiGS) Pretraining Without Attention
- LOCOST: State-Space Models for Long Document Abstractive Summarization
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
- (SPADE) Efficient Long Sequence Modeling via State Space Augmented Transformer
- SMR: State Memory Replay for Long Sequence Modeling
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Fine-tuning Methods
Representation Learning (Embeddings)
- (DISCO) Towards Learning (Dis)-Similarity of Source Code from Program Contrasts
- Neural Code Comprehension: A Learnable Representation of Code Semantics
- TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation
- Flow2Vec: Value-Flow-Based Precise Code Embedding
- (SCodeR) Soft-Labeled Contrastive Pre-training for Function-level Code Representation
- Code Representation Learning At Scale
- (Corder) Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations
- (GraphCodeBERT) GraphCodeBERT: Pre-training Code Representations with Data Flow
- (SynCoBERT) SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
Code Understanding
- (ModernBERT-Large-Instruct) It’s All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers
- Semantic Word and Sentence Embeddings Compression using Discrete Wavelet Transform
- (LongMem) Augmenting Language Models with Long-Term Memory
- (ChatDANCE) You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search
Code Generation
- RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion
- Type-Constrained Code Generation with Language Models
- Correctness-Guaranteed Code Generation via Constrained Decoding
Secure Coding Assistance
- A Survey of Trojans in Neural Models of Source Code: Taxonomy and Techniques
- Prompt Injection Attacks and Defenses in LLM-Integrated Applications
- Security Attacks on LLM-based Code Completion Tools
- When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents
- Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation
Coding Agents
- ChatDev: Communicative Agents for Software Development
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering