Korea University · Document AI Group

KUDoc

KUDoc is the Document AI research group at Korea University, turning unstructured documents into structured evidence for reliable reasoning and real-world AI.

Publications GitHub Meet the Group

Core Researchers

Research Areas

Top Conferences

Projects

Milestones

⚙️ 2024.03 – Present

Synapse

View in Projects ↓

📄 2024.11

EMNLP 2024

1 paper accepted

StyleDFS — Predictive Maintenance RAG for Power Plants

⚙️ 2025.01 – Present

DocGraph Copilot

View in Projects ↓

📄 2025.05

ACL 2025

2 papers accepted

REVISE — OCR error correction via data contamination strategy
Enhancing Automatic Term Extraction with LLMs via Syntactic Retrieval

📄 2025.11

EMNLP 2025

3 papers accepted

MultiDocFusion — Hierarchical multimodal chunking pipeline
Benchmark Profiling — Mechanistic diagnosis of LLM benchmarks (Oral)
KoLEG — Korean legal knowledge editing with continuous retrieval

📄 2026.06

CVPR 2026

1 paper accepted

M3DocDep — Multi-document dependency chunking with LVLMs

📄 2026.08

ACL 2026

5 papers accepted

HiKEY — Hierarchical multimodal retrieval for document QA
EASE — Entity-aware sub-table generation for multi-table QA
LangSAE Editing — Multilingual IR via post-hoc language identity removal
Exploring Coding Spot — Parametric contributions to LLM coding performance
MMAC — Multilingual multimodal alignment for cultural grounding evaluation

What we work on

Research Focus

📄

Document Understanding

Layout-aware parsing, table / figure extraction, and structural analysis of complex documents including PDFs, scanned images, and rich-format files.

🔍

Multimodal RAG

Retrieval-augmented generation pipelines that fuse text, layout, and visual signals for accurate, grounded answers from large document corpora.

🧠

LLM-based Reasoning

Prompting strategies, fine-tuning, and agent frameworks that enable large language models to reason faithfully over long, structured documents.

📊

Information Extraction

Named entity recognition, relation extraction, and key-value pair detection from semi-structured and unstructured business documents.

📏

Benchmark & Evaluation

Constructing rigorous evaluation suites and leaderboards for Document AI tasks, with a focus on Korean-language and multilingual settings.

🌐

Multilingual Document AI

Extending Document AI capabilities to Korean and other low-resource languages, addressing the unique challenges of mixed-script documents.

Selected Work

Publications

Document AI & Multimodal

International

ACL 2026 MAINOral

HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

Joongmin Shin, Gyuho Shim, Jeongbae Park, Jaehyung Seo, Heuiseok Lim

HiKEY introduces a hierarchical multimodal retrieval framework for open-domain document question answering, combining global routing, local ranking, and structured evidence assembly across long and visually complex documents.

Paper

ACL 2026 MAIN

EASE: Entity-Aware Sub-table Generation for Real-world Multi-table QA

Myunghoon Kang, Dahyun Jung, Suhyune Son, Seonmin Koo, Changwoo Chun, Yuna Hur

Proposes an entity-aware sub-table generation framework for question answering over real-world multi-table settings, addressing challenges of entity disambiguation and cross-table reasoning in structured document understanding.

CVPR 2026 MAIN

M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

Joongmin Shin, Jeongbae Park, Jaehyung Seo, Heuiseok Lim

Introduces a dependency-aware chunking framework that leverages LVLMs to capture structural and semantic dependencies across multiple pages and documents for improved retrieval.

Paper

EMNLP 2025 MAIN

MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents

Joongmin Shin, Chanjun Park, Jeongbae Park, Jaehyung Seo, Heuiseok Lim

Proposes a hierarchical multimodal chunking pipeline that integrates text, layout, and visual signals to boost RAG performance on long, complex industrial documents.

Paper

ACL 2025 INDUSTRYOral

REVISE: A Framework for Revising OCRed Text in Practical Information Systems with Data Contamination Strategy

Gyuho Shim, Seongtae Hong, Heuiseok Lim

Introduces a hierarchical taxonomy of OCR errors and a synthetic data contamination strategy that injects realistic OCR-like noise at the character, word, and structural level — enabling robust document reconstruction without costly real-error annotations.

GitHub

EMNLP 2024 INDUSTRY

Intelligent Predictive Maintenance RAG Framework for Power Plants: Enhancing QA with StyleDFS and Domain-Specific Instruction Tuning

Seongtae Hong, Joongmin Shin, Jaehyung Seo, Taemin Lee, Jeongbae Park, Heuiseok Lim

Presents a domain-specific RAG system for power plant maintenance QA, combining StyleDFS retrieval and instruction-tuned generation for industrial deployment.

Paper

Domestic

HCLT 2024

초거대 언어모델을 활용한 교육 도메인의 구조적 정보 추출

Myunghoon Kang

초거대 언어 모델을 활용하여 교육 도메인에서 구조화된 정보를 추출하는 과업을 제안한다. 교과서 단락 텍스트로부터 핵심어를 계층적으로 추출하는 평가 데이터셋을 구축하고, 오픈 소스 및 사유 LLM에 대한 In-Context Learning 방식의 평가를 진행한다.

Others

International

ACL 2026 MAINOral

LangSAE Editing: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal

Proposes a post-hoc language identity removal approach using sparse autoencoder editing to improve cross-lingual retrieval performance in multilingual language models.

ACL 2026 FINDINGS

Exploring Coding Spot: Understanding Parametric Contributions to LLM Coding Performance

Dongjun Kim, Minhyuk Kim, Yong Chan Chun, Chanjun Park, Heuiseok Lim

Investigates which parameters within LLMs are responsible for coding capabilities through mechanistic analysis, offering insights for targeted fine-tuning and model understanding.

ACL 2026 MAIN

MMAC: A Multilingual, Multimodal Alignment Framework for Cultural Grounding Evaluation

Introduces a multilingual and multimodal alignment benchmark for evaluating cultural grounding in vision-language models across diverse linguistic and cultural contexts.

EMNLP 2025 MAINOral

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim, Gyuho Shim, Yong Chan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim

Proposes a mechanistic diagnostic framework for LLM benchmark evaluation, identifying systematic failure modes and biases in standard benchmarks through profiling techniques.

EMNLP 2025 FINDINGS

KoLEG: On-the-Fly Korean Legal Knowledge Editing with Continuous Retrieval

Jaehyung Seo, Dahyun Jung, Jaewook Lee, Yong Chan Chun, Dongjun Kim, Hwijung Ryu, Donghoon Shin, Heuiseok Lim

Introduces a continuous retrieval framework for on-the-fly Korean legal knowledge editing, enabling dynamic updates to LLM knowledge bases in domain-specific legal contexts.

ACL 2025 FINDINGS

Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval

Yong Chan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim

Leverages syntactic retrieval with LLMs to improve automatic term extraction, achieving stronger precision and recall on domain-specific terminology identification tasks.

Systems & Applied AI

Projects

Document AI & Multimodal

2025.01 — Present Korea University

DocGraph Copilot

Multimodal Chunking for Real-World Industrial Documents

Developed a novel multimodal chunking methodology for real-world industrial data. It is designed to chunk documents by converting their layout information into a hierarchical structure — enabling more accurate and grounded retrieval-augmented generation over complex, long-form industrial documents.

Based on MultiDocFusion HiKEY M3DocDep

2024.03 — Present Government-funded

Synapse

Development of an LLM-Based Multilingual Consultation Service

Synapse is a multilingual consultation system powered by large language models, designed to deliver accurate, context-aware guidance across language barriers. The project focuses on robust multilingual understanding, domain-specific instruction tuning, and safe deployment of LLMs in real-world advisory contexts.

Based on StyleDFS REVISE

Others

2023.09 — 2024.08 Government-funded

AI-based Thought Structuring Learning Tutor

Development of AI-based Thought Structuring Learning Tutoring Technology for Improving Literacy Competency

Developed a structured summarization evaluation dataset and framework based on textbook document parsing, designed to score and diagnose learners' structured learning outputs. This work establishes a core foundation for AI-driven tutoring systems in the educational domain.

Based on 초거대 LLM 교육 도메인 구조적 정보 추출

NC AI Consortium WBL · ETRI

VAETKI

Large Language Model — NC AI Consortium (13 Organizations)

VAETKI is a large language model developed by the NC AI consortium, a collaborative initiative led by NC AI with participation from 13 organizations including WBL and ETRI. Designed with scalability and efficiency as primary goals, VAETKI adopts a Mixture-of-Experts (MoE) architecture to effectively balance performance and computational cost. Key features include tool agent tasks in non-thinking mode, human preference alignment for accurate instruction following, and multilingual support across English, Korean, Chinese, and Japanese.

KUDoc

Research Focus

Document Understanding

Multimodal RAG

LLM-based Reasoning

Information Extraction

Benchmark & Evaluation

Multilingual Document AI

Group

Publications

Projects

DocGraph Copilot

Synapse

AI-based Thought Structuring Learning Tutor

VAETKI

Collaborating Institutions

Contact