2008-03-14, 02:43 AM
I haven't seen anybody discussing closed captions extracted from OTA HDTV (USA) channels, so I thought I would share my findings. Also, I need help with postprocessing.bat file, but about that later.
So, here is what is working for me:
I record only HD channels on my PVR-1600 and save in dvr-ms format.
I use ccextractor v0.34 (http://ccextractor.sourceforge.net/) using following command in postprocessing.bat:
ccextractor -srt %1
I have toggled on the "Enable SRT viewer" in GB-PVR Configuration, Misc Screen, so I can toggle CC on and off using yellow button on Hauppauge remote or Ctrl-y on keyboard.
The results are inconsistent, sometimes the cc is perfect, sometimes it produces so many extra characters it is difficult to read. I haven't played with it long enough to know what it depends on. No matter what, it is helpful for me since English is my second language.
Before using ccextractor, I tried mpg2srt, unsuccessfully.
- mpg2srt does not process dvr-ms files.
- mpg2srt produced empty *.srt and *.sami files from my *.mpg recordings.
So, it looks like for HDTV ccextractor is the only viable solution.
I am thinking about adding info about it in wiki once I am more confident in this solution.
=============================
Now comes the request for help::confused:
If I ask ccextractor to process file test.dvr-ms it produces file test_1.srt which I have to rename manually to test.srt.
What command do I add to postprocessing.bat so it would remove _1 from the name of the file?
I tried to find the proper option in ccextractor, and wrote email to the program creator, so far no response.
Can you guys help?
Thanks.
The syntax of ccextractwin is:
CCExtractor v0.34, cfsmp3 at gmail
----------------------------------
Heavily based on McPoodle's tools. Check his page for lots of information
on closed captions technical details.
(http://www.geocities.com/mcpoodle43/SCC_...TOOLS.HTML)
This tool home page:
http://ccextractor.sourceforge.net
Extracts closed captions from MPEG files.
(DVB, .TS, ReplayTV 4000 and 5000, dvr-ms, bttv and Dish Network are known
to work).
Syntax:
ccextractor [options] inputfile1 [inputfile2...] [-o outputfilename]
[-o1 outputfilename1] [-o2 outputfilename2]
File name related options:
inputfile: file(s) to process
-o outputfilename: Use -o parameters to define output filename if you don't
like the default ones (same as infile plus _1 or _2 when
needed and .bin or .srt extension).
-o or -o1 -> Name of the first (maybe only) output
file.
-o2 -> Name of the second output file, when
it applies.
-cf filename: Write 'clean' data to a file. Cleans means the ES
without TS or PES headers.
You can pass as many input files as you need. They will be processed in order.
Output will be one single file (either raw or srt). Use this if you made your
recording in several cuts (to skip commercials for example) but you want one
subtitle file with contiguous timing.
Options that affect what will be processed:
-1, -2, -12: Output Field 1 data, Field 2 data, or both
(DEFAULT is -1)
-cc2: When in srt/sami mode, process captions in channel 2
instead channel 1.
In general, if you want English subtitles you don't need to use these options
as they are broadcast in field 1, channel 1. If you want the second language
(usually Spanish) you may need to try -2, or -cc2, or both.
Options that affect how input files will be processed.
-ts: Force Transport Stream mode.
-nots: Disable Transport Stream mode.
-bin: Process a raw (bin) closed captions dump instead of a
MPEG files. Requires that either -srt or -sami is used
as well.
-myth: Force MythTV code branch.
-nomyth: Disable MythTV code branch.
-fp --fixpadding: Fix padding - some cards (or providers, or whatever)
seem to send 0000 as CC padding instead of 8080. If you
get bad timing, this might solve it.
Usually you only need to use -bin (if you want to produce srt/sami from a
dump of previously extracted closed captions). For MPEG files, transport
stream mode is autodetected. The MythTV branch is needed for analog captures
such as those with bttv cards (Hauppage 250 for example), which is detected
as well. You can however force whatever you need in case autodetection
doesn't work for you.
Options that affect what kind of output will be produced:
-d: Output raw captions in DVD format
(DEFAULT is broadcast format)
-srt: Generate .srt instead of .bin.
-sami: Generate .sami instead of .bin.
-utf8: Encode subtitles in UTF-8 instead of Latin-1
-unicode: Encode subtitles in Unicode instead of Latin-1
-nofc --nofontcolor: For .srt/.sami, don't add font color tags.
-sc --sentencecap: Sentence capitalization. Use if you hate.
ALL CAPS in subtitles.
--capfile -caf file: Add the contents of 'file' to the list of words
that must be capitalized. For example, if file
is a plain text file that contains
Tony
Alan
Whenever those words are found they will be written
exactly as they appear in the file.
Use one line per word. Lines starting with # are
considered comments and discarded.
Options that affect how ccextractor reads and writes (buffering):
-bo -bufferoutput: Buffer writes. Might help a bit with performance.
-bi -bufferinput: Forces input buffering.
-nobi -nobufferinput: Disables input buffering.
Options that affect the built-in closed caption decoder:
-dru: Direct Roll-Up. When in roll-up mode, write character by
character instead of line by line. Note that this
produces (much) larger files.
-noff: Disable FF clean-up. This is extra sanity check when
processing CC blocks. FF clean-up usually gets rid of
garbage produced by false CC block, but might cause
good characters to be missed. Use this option if you
prefer not to have any character discarded. Note that
this option is probably no longer needed and will
be removed soon.
Options that affect timing:
-noap --noautopad: Disable autopad. By default ccextractor pads closed
captions data to ensure that there's exactly 29.97 CC
2-byte blocks per second. Usually this fixes timing
issues, but you may disable it with this option.
Note that autopadding only happens in TS mode.
-gp --goppad: Use GOP timing for padding instead of PTS. Use this
if you need padding on a non-TS file.
-delay ms: For srt/sami, add this number of milliseconds to
all times. For example, -delay 400 makes subtitles
appear 400ms late. You can also use negative numbers
to make subs appear early.
Notes on times: -startat and -endat times are used first, then -delay.
So if you use -srt -startat 3:00 -endat 5:00 -delay 12000, ccextractor will
generate a .srt file, with only data from 3:00 to 5:00 in the input file(s)
and then add that (huge) delay, which would make the final file start at
5:00 and end at 7:00.
Options that affect what segment of the input file(s) to process:
-startat time: For .srt/.sami, only write subtitles that start after
the given time. Time can be seconds, MM:SS or HH:MM:SS.
For example, -startat 3:00 means 'start writing from
minute 3.
This option is ignored in raw mode.
-endat time: Stop processing after the given time (same format as
-startat). This option is honored in all output
formats.
-scr --screenfuls num: Write 'num' screenfuls and terminate processing.
Options that affect debug data:
-debug: For HDTV dumps 'interesting' packets.
-608: Print debug traces from the EIA-608 decoder.
If you need to submit a bug report, please send
the output from this option.
So, here is what is working for me:
I record only HD channels on my PVR-1600 and save in dvr-ms format.
I use ccextractor v0.34 (http://ccextractor.sourceforge.net/) using following command in postprocessing.bat:
ccextractor -srt %1
I have toggled on the "Enable SRT viewer" in GB-PVR Configuration, Misc Screen, so I can toggle CC on and off using yellow button on Hauppauge remote or Ctrl-y on keyboard.
The results are inconsistent, sometimes the cc is perfect, sometimes it produces so many extra characters it is difficult to read. I haven't played with it long enough to know what it depends on. No matter what, it is helpful for me since English is my second language.
Before using ccextractor, I tried mpg2srt, unsuccessfully.
- mpg2srt does not process dvr-ms files.
- mpg2srt produced empty *.srt and *.sami files from my *.mpg recordings.
So, it looks like for HDTV ccextractor is the only viable solution.
I am thinking about adding info about it in wiki once I am more confident in this solution.
=============================
Now comes the request for help::confused:
If I ask ccextractor to process file test.dvr-ms it produces file test_1.srt which I have to rename manually to test.srt.
What command do I add to postprocessing.bat so it would remove _1 from the name of the file?
I tried to find the proper option in ccextractor, and wrote email to the program creator, so far no response.
Can you guys help?
Thanks.
The syntax of ccextractwin is:
CCExtractor v0.34, cfsmp3 at gmail
----------------------------------
Heavily based on McPoodle's tools. Check his page for lots of information
on closed captions technical details.
(http://www.geocities.com/mcpoodle43/SCC_...TOOLS.HTML)
This tool home page:
http://ccextractor.sourceforge.net
Extracts closed captions from MPEG files.
(DVB, .TS, ReplayTV 4000 and 5000, dvr-ms, bttv and Dish Network are known
to work).
Syntax:
ccextractor [options] inputfile1 [inputfile2...] [-o outputfilename]
[-o1 outputfilename1] [-o2 outputfilename2]
File name related options:
inputfile: file(s) to process
-o outputfilename: Use -o parameters to define output filename if you don't
like the default ones (same as infile plus _1 or _2 when
needed and .bin or .srt extension).
-o or -o1 -> Name of the first (maybe only) output
file.
-o2 -> Name of the second output file, when
it applies.
-cf filename: Write 'clean' data to a file. Cleans means the ES
without TS or PES headers.
You can pass as many input files as you need. They will be processed in order.
Output will be one single file (either raw or srt). Use this if you made your
recording in several cuts (to skip commercials for example) but you want one
subtitle file with contiguous timing.
Options that affect what will be processed:
-1, -2, -12: Output Field 1 data, Field 2 data, or both
(DEFAULT is -1)
-cc2: When in srt/sami mode, process captions in channel 2
instead channel 1.
In general, if you want English subtitles you don't need to use these options
as they are broadcast in field 1, channel 1. If you want the second language
(usually Spanish) you may need to try -2, or -cc2, or both.
Options that affect how input files will be processed.
-ts: Force Transport Stream mode.
-nots: Disable Transport Stream mode.
-bin: Process a raw (bin) closed captions dump instead of a
MPEG files. Requires that either -srt or -sami is used
as well.
-myth: Force MythTV code branch.
-nomyth: Disable MythTV code branch.
-fp --fixpadding: Fix padding - some cards (or providers, or whatever)
seem to send 0000 as CC padding instead of 8080. If you
get bad timing, this might solve it.
Usually you only need to use -bin (if you want to produce srt/sami from a
dump of previously extracted closed captions). For MPEG files, transport
stream mode is autodetected. The MythTV branch is needed for analog captures
such as those with bttv cards (Hauppage 250 for example), which is detected
as well. You can however force whatever you need in case autodetection
doesn't work for you.
Options that affect what kind of output will be produced:
-d: Output raw captions in DVD format
(DEFAULT is broadcast format)
-srt: Generate .srt instead of .bin.
-sami: Generate .sami instead of .bin.
-utf8: Encode subtitles in UTF-8 instead of Latin-1
-unicode: Encode subtitles in Unicode instead of Latin-1
-nofc --nofontcolor: For .srt/.sami, don't add font color tags.
-sc --sentencecap: Sentence capitalization. Use if you hate.
ALL CAPS in subtitles.
--capfile -caf file: Add the contents of 'file' to the list of words
that must be capitalized. For example, if file
is a plain text file that contains
Tony
Alan
Whenever those words are found they will be written
exactly as they appear in the file.
Use one line per word. Lines starting with # are
considered comments and discarded.
Options that affect how ccextractor reads and writes (buffering):
-bo -bufferoutput: Buffer writes. Might help a bit with performance.
-bi -bufferinput: Forces input buffering.
-nobi -nobufferinput: Disables input buffering.
Options that affect the built-in closed caption decoder:
-dru: Direct Roll-Up. When in roll-up mode, write character by
character instead of line by line. Note that this
produces (much) larger files.
-noff: Disable FF clean-up. This is extra sanity check when
processing CC blocks. FF clean-up usually gets rid of
garbage produced by false CC block, but might cause
good characters to be missed. Use this option if you
prefer not to have any character discarded. Note that
this option is probably no longer needed and will
be removed soon.
Options that affect timing:
-noap --noautopad: Disable autopad. By default ccextractor pads closed
captions data to ensure that there's exactly 29.97 CC
2-byte blocks per second. Usually this fixes timing
issues, but you may disable it with this option.
Note that autopadding only happens in TS mode.
-gp --goppad: Use GOP timing for padding instead of PTS. Use this
if you need padding on a non-TS file.
-delay ms: For srt/sami, add this number of milliseconds to
all times. For example, -delay 400 makes subtitles
appear 400ms late. You can also use negative numbers
to make subs appear early.
Notes on times: -startat and -endat times are used first, then -delay.
So if you use -srt -startat 3:00 -endat 5:00 -delay 12000, ccextractor will
generate a .srt file, with only data from 3:00 to 5:00 in the input file(s)
and then add that (huge) delay, which would make the final file start at
5:00 and end at 7:00.
Options that affect what segment of the input file(s) to process:
-startat time: For .srt/.sami, only write subtitles that start after
the given time. Time can be seconds, MM:SS or HH:MM:SS.
For example, -startat 3:00 means 'start writing from
minute 3.
This option is ignored in raw mode.
-endat time: Stop processing after the given time (same format as
-startat). This option is honored in all output
formats.
-scr --screenfuls num: Write 'num' screenfuls and terminate processing.
Options that affect debug data:
-debug: For HDTV dumps 'interesting' packets.
-608: Print debug traces from the EIA-608 decoder.
If you need to submit a bug report, please send
the output from this option.
[SIZE="2"]GBPVR 1.3.11 on WinXP SP2; ATSC OTA.
Core 2 Duo 2.2GHz; 2GB RAM; NVIDIA 8500GT 256MB; Hauppauge HVR-1600 and Pinnacle HD Pro, 720p HDTV;[/SIZE]
Core 2 Duo 2.2GHz; 2GB RAM; NVIDIA 8500GT 256MB; Hauppauge HVR-1600 and Pinnacle HD Pro, 720p HDTV;[/SIZE]