Korea University · Document AI Group

KUDoc

KUDoc is the Document AI research group at Korea University, turning unstructured documents into structured evidence for reliable reasoning and real-world AI.

5
Core Researchers
3+
Research Areas
12
Top Conferences
4
Projects
Milestones
⚙️ 2024.03 – Present
Synapse
View in Projects ↓
📄 2024.11
EMNLP 2024
1 paper accepted
  • StyleDFS — Predictive Maintenance RAG for Power Plants
⚙️ 2025.01 – Present
DocGraph Copilot
View in Projects ↓
📄 2025.05
ACL 2025
2 papers accepted
  • REVISE — OCR error correction via data contamination strategy
  • Enhancing Automatic Term Extraction with LLMs via Syntactic Retrieval
📄 2025.11
EMNLP 2025
3 papers accepted
  • MultiDocFusion — Hierarchical multimodal chunking pipeline
  • Benchmark Profiling — Mechanistic diagnosis of LLM benchmarks (Oral)
  • KoLEG — Korean legal knowledge editing with continuous retrieval
📄 2026.06
CVPR 2026
1 paper accepted
  • M3DocDep — Multi-document dependency chunking with LVLMs
📄 2026.08
ACL 2026
5 papers accepted
  • HiKEY — Hierarchical multimodal retrieval for document QA
  • EASE — Entity-aware sub-table generation for multi-table QA
  • LangSAE Editing — Multilingual IR via post-hoc language identity removal
  • Exploring Coding Spot — Parametric contributions to LLM coding performance
  • MMAC — Multilingual multimodal alignment for cultural grounding evaluation
What we work on

Research Focus

📄

Document Understanding

Layout-aware parsing, table / figure extraction, and structural analysis of complex documents including PDFs, scanned images, and rich-format files.

🔍

Multimodal RAG

Retrieval-augmented generation pipelines that fuse text, layout, and visual signals for accurate, grounded answers from large document corpora.

🧠

LLM-based Reasoning

Prompting strategies, fine-tuning, and agent frameworks that enable large language models to reason faithfully over long, structured documents.

📊

Information Extraction

Named entity recognition, relation extraction, and key-value pair detection from semi-structured and unstructured business documents.

📏

Benchmark & Evaluation

Constructing rigorous evaluation suites and leaderboards for Document AI tasks, with a focus on Korean-language and multilingual settings.

🌐

Multilingual Document AI

Extending Document AI capabilities to Korean and other low-resource languages, addressing the unique challenges of mixed-script documents.

People

Group

Joongmin Shin
Group Lead
Founder
Joongmin Shin
Multimodal AI Researcher · Human-Inspired AI Research
Gyuho Shim
Co-Lead
Co-Founder
Gyuho Shim
심규호
M.S. · Korea University
Myunghoon Kang
Member
Myunghoon Kang
강명훈
Ph.D. Student · Korea University
Dongjun Kim
Member
Dongjun Kim
AI Research Engineer · Upstage
Duong Tuan Thanh
Member
Duong Tuan Thanh
M.S. · Korea University
Selected Work

Publications

Document AI & Multimodal
International
ACL 2026 MAINOral
HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering
Joongmin Shin, Gyuho Shim, Jeongbae Park, Jaehyung Seo, Heuiseok Lim
HiKEY introduces a hierarchical multimodal retrieval framework for open-domain document question answering, combining global routing, local ranking, and structured evidence assembly across long and visually complex documents.
ACL 2026 MAIN
EASE: Entity-Aware Sub-table Generation for Real-world Multi-table QA
Myunghoon Kang, Dahyun Jung, Suhyune Son, Seonmin Koo, Changwoo Chun, Yuna Hur
Proposes an entity-aware sub-table generation framework for question answering over real-world multi-table settings, addressing challenges of entity disambiguation and cross-table reasoning in structured document understanding.
CVPR 2026 MAIN
M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
Joongmin Shin, Jeongbae Park, Jaehyung Seo, Heuiseok Lim
Introduces a dependency-aware chunking framework that leverages LVLMs to capture structural and semantic dependencies across multiple pages and documents for improved retrieval.
EMNLP 2025 MAIN
MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents
Joongmin Shin, Chanjun Park, Jeongbae Park, Jaehyung Seo, Heuiseok Lim
Proposes a hierarchical multimodal chunking pipeline that integrates text, layout, and visual signals to boost RAG performance on long, complex industrial documents.
ACL 2025 INDUSTRYOral
REVISE: A Framework for Revising OCRed Text in Practical Information Systems with Data Contamination Strategy
Gyuho Shim, Seongtae Hong, Heuiseok Lim
Introduces a hierarchical taxonomy of OCR errors and a synthetic data contamination strategy that injects realistic OCR-like noise at the character, word, and structural level — enabling robust document reconstruction without costly real-error annotations.
EMNLP 2024 INDUSTRY
Intelligent Predictive Maintenance RAG Framework for Power Plants: Enhancing QA with StyleDFS and Domain-Specific Instruction Tuning
Seongtae Hong, Joongmin Shin, Jaehyung Seo, Taemin Lee, Jeongbae Park, Heuiseok Lim
Presents a domain-specific RAG system for power plant maintenance QA, combining StyleDFS retrieval and instruction-tuned generation for industrial deployment.
Domestic
HCLT 2024
초거대 언어모델을 활용한 교육 도메인의 구조적 정보 추출
Myunghoon Kang
초거대 언어 모델을 활용하여 교육 도메인에서 구조화된 정보를 추출하는 과업을 제안한다. 교과서 단락 텍스트로부터 핵심어를 계층적으로 추출하는 평가 데이터셋을 구축하고, 오픈 소스 및 사유 LLM에 대한 In-Context Learning 방식의 평가를 진행한다.
Others
International
ACL 2026 MAINOral
LangSAE Editing: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal
Proposes a post-hoc language identity removal approach using sparse autoencoder editing to improve cross-lingual retrieval performance in multilingual language models.
ACL 2026 FINDINGS
Exploring Coding Spot: Understanding Parametric Contributions to LLM Coding Performance
Dongjun Kim, Minhyuk Kim, Yong Chan Chun, Chanjun Park, Heuiseok Lim
Investigates which parameters within LLMs are responsible for coding capabilities through mechanistic analysis, offering insights for targeted fine-tuning and model understanding.
ACL 2026 MAIN
MMAC: A Multilingual, Multimodal Alignment Framework for Cultural Grounding Evaluation
Introduces a multilingual and multimodal alignment benchmark for evaluating cultural grounding in vision-language models across diverse linguistic and cultural contexts.
EMNLP 2025 MAINOral
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
Dongjun Kim, Gyuho Shim, Yong Chan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim
Proposes a mechanistic diagnostic framework for LLM benchmark evaluation, identifying systematic failure modes and biases in standard benchmarks through profiling techniques.
EMNLP 2025 FINDINGS
KoLEG: On-the-Fly Korean Legal Knowledge Editing with Continuous Retrieval
Jaehyung Seo, Dahyun Jung, Jaewook Lee, Yong Chan Chun, Dongjun Kim, Hwijung Ryu, Donghoon Shin, Heuiseok Lim
Introduces a continuous retrieval framework for on-the-fly Korean legal knowledge editing, enabling dynamic updates to LLM knowledge bases in domain-specific legal contexts.
ACL 2025 FINDINGS
Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
Yong Chan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim
Leverages syntactic retrieval with LLMs to improve automatic term extraction, achieving stronger precision and recall on domain-specific terminology identification tasks.
Systems & Applied AI

Projects

Document AI & Multimodal
2025.01 — Present Korea University

DocGraph Copilot

Multimodal Chunking for Real-World Industrial Documents

Developed a novel multimodal chunking methodology for real-world industrial data. It is designed to chunk documents by converting their layout information into a hierarchical structure — enabling more accurate and grounded retrieval-augmented generation over complex, long-form industrial documents.

2024.03 — Present Government-funded

Synapse

Development of an LLM-Based Multilingual Consultation Service

Synapse is a multilingual consultation system powered by large language models, designed to deliver accurate, context-aware guidance across language barriers. The project focuses on robust multilingual understanding, domain-specific instruction tuning, and safe deployment of LLMs in real-world advisory contexts.

Based on StyleDFS REVISE
Others
2023.09 — 2024.08 Government-funded

AI-based Thought Structuring Learning Tutor

Development of AI-based Thought Structuring Learning Tutoring Technology for Improving Literacy Competency

Developed a structured summarization evaluation dataset and framework based on textbook document parsing, designed to score and diagnose learners' structured learning outputs. This work establishes a core foundation for AI-driven tutoring systems in the educational domain.

NC AI Consortium WBL · ETRI

VAETKI

Large Language Model — NC AI Consortium (13 Organizations)

VAETKI is a large language model developed by the NC AI consortium, a collaborative initiative led by NC AI with participation from 13 organizations including WBL and ETRI. Designed with scalability and efficiency as primary goals, VAETKI adopts a Mixture-of-Experts (MoE) architecture to effectively balance performance and computational cost. Key features include tool agent tasks in non-thinking mode, human preference alignment for accurate instruction following, and multilingual support across English, Korean, Chinese, and Japanese.

Network

Collaborating Institutions

KUDoc is anchored at Korea University and works in close research collaboration with institutions across academia and industry in Korea, Canada, and Singapore.

Academic Institutions
Anchor Institution
Korea University
South Korea
Collaborating Institution
Konkuk University
South Korea
Collaborating Institution
Soongsil University
South Korea
Collaborating Institution
York University
Canada
Collaborating Institution
SUTD
Singapore University of Technology and Design · Singapore
Industry & Research Collaborators
Collaborating Organization
Microsoft Asia
Corporate Research & Technology
Collaborating Organization
Upstage
AI Company · South Korea
Get in touch

Contact

Get Involved with KUDoc
We are always looking for collaborators! If you're interested in joining our efforts, drop us a line at: